An apprenticeship model for human and AI collaborative essay grading Alok Baikadi Lee Becker Jill Budden alok.baikadi@pearson.com leek.becker@pearson.com jill.budden@pearson.com Pearson Pearson Pearson Peter Foltz Andrew Gorman Scott Hellman peter.foltz@pearson.com andrew.gorman@pearson.com scott.hellman@pearson.com Pearson Pearson Pearson William Murray Mark Rosenstein william.murray@pearson.com mark.rosenstein@pearson.com Pearson Pearson CCS CONCEPTS Over the last 20 years, automated scoring of student writ- • Human-centered computing → Human computer inter- ing through the use of natural language processing, machine action (HCI). learning and artificial intelligence techniques coupled with human-scored training sets has in many applications achieved KEYWORDS performance comparable to that of humans [19]. In high Human-ai collaboration; Intelligent user interfaces; Machine stakes testing, millions of students have had their writing learning; Active learning; Automated essay grading automatically scored, with the prompts, rubrics and human scoring for training and validating model performance imple- ACM Reference Format: mented in tightly controlled frameworks. Our work addresses Alok Baikadi, Lee Becker, Jill Budden, Peter Foltz, Andrew Gorman, the question of how to move this technology to a wider au- Scott Hellman, William Murray, and Mark Rosenstein. 2019. An dience and adding formative assessment to have a broader apprenticeship model for human and AI collaborative essay grading. In Joint Proceedings of the ACM IUI 2019 Workshops, Los Angeles, impact in helping students learn to write. USA, March 20, 2019 , 6 pages. Our goal is to lower the barriers that limit an instructor’s ability to assign essays in their courses. Our approach is 1 INTRODUCTION to develop a system in which instructors develop prompts appropriate for their course and, by scoring a subset of their Across a range of domains, humans and complex algorithms student responses, are able to turn on an automated system embedded in systems are collaborating and coordinating ac- to score the rest of the responses. These prompts and the tion. Tasks are shared because the combined system can be ability to automatically score them become a resource that more capable than either agent acting alone. Such systems instructors can reuse and even share. The critical issue is with shared autonomy raise important research questions in that while the instructor is an expert in their domain, they how to design these joint systems to best utilize each of the are likely not an expert in either assessment or the machine actors capabilities for optimal performance while addressing learning techniques that make automated scoring possible. important safety, legal and ethical issues. We have piloted a prototype system in a large introductory Our work investigates students developing their writing psychology course at a large university. This pilot explored ability throughout their educational career, since writing pro- the issues of 1) transferring scoring expertise from an instruc- ficiency is a critical life and career competency. Writing is a tor to an automated system and 2) using an automated system skill that develops through practice and feedback. However, to provide feedback to the instructor about the quality of the the massive effort required of instructors in providing feed- current state of its scoring. In most end user applications of AI, back on drafts and scoring final versions is a limiting factor in the user is only exposed to decisions or behavior from some assigning essays and short answer problems in their classes. unseen, unknown model, and training of the machine learning mechanism is either hidden or taken for granted by the user. IUI Workshops’19, March 20, 2019, Los Angeles, USA Exposing machine learning flows to users unversed in the no- © 2019 Copyright © 2019 for the individual papers by the papers’ authors. tions of model performance and evaluation raises interesting Copying permitted for private and academic purposes. This volume is pub- design questions around user trust, system transparency, and lished and copyrighted by its editors. managing expectations. IUI Workshops’19, March 20, 2019, Los Angeles, USA Baikadi, et al. We discuss the approach used for the pilot and some of the have impacted the modeling choices in both research and in issues that emerged. We then show how adopting a metaphor commercial systems. These techniques include hierarchical of apprenticeship clarifies the communication between the classification [14], correlated linear regression [17] and var- instructor and the AI assistant. In this paper we discuss the ious neural net deeplearning approaches [7, 21] and many issues of shared autonomy that arise in such a system and others. In addition, some commercial systems have described the issues we have seen, in defining the task, in making re- their modeling subsystems, e.g. [4]. sults mutually interpretable to the instructor and the machine, As we move away from high stakes scoring with precisely and in designing user interfaces that makes more transparent trained models to formative scoring with instructor-trained all the various factors that are required for making correct models (and the use of automated scoring in the classroom), decisions on how the task should proceed. the burden of generating reliable scores to train the automated system now falls on the instructor. For the automated sys- 2 RELIABLE SCORING tem to reliably score, the instructor must score a sufficiently Current high stakes scoring implementations attempt to achieve large number of essays to capture the variability of student reliability through various means: the use of rubrics to specify responses and do so in a sufficiently reliable fashion to allow the results that must be evidenced at each score point [9, 20]; the regularities of scoring to be learnable by machine learn- anchor papers, which are example, actual essays selected to ing techniques. In the system we have developed, where the indicate typical and boundary score point answers [15]; and system learns from an instructor’s scoring behavior, the in- supervised scorer training, which often includes practice scor- structor only need score enough essays to build a performant ing exercises and required levels of performance on example model. The hurdle is that, as the instructor scores, the system essays. Yet despite these rigorous preparations, raters still needs to provide feedback to the instructor as the instructor disagree. Psychometricians have developed methods to detect scores on how well the current model is performing. In an subtle biases in raters referred to as rater effects [23]. These intelligible manner, the system must update the instructor on techniques can allow performance monitoring during scoring its progress. The AI system must continually provide infor- and detection after scoring. mation to the instructor to allow the instructor to make an In complex tasks, such as writing an essay, even with well- informed decision about the quality of the automated scoring trained scorers without rater effects there will be an expected and determine when it is justifiable to turn scoring over to the level of disagreement over the score of individual essays. A automated system. number of studies have found that in well-designed prompts and rubrics with well-trained scorers, the expected range of 3 SYSTEM DESCRIPTION adjacent agreement, i.e., scores within 1 point, is 80-99% and We have developed a prototype system which allows instruc- correlations in the 0.7 to 0.8 range (Brown et al. [2] provide a tors to assign writing to their students and participate in AI- summary of research and standards in this area). assisted grading workflows. The system allows an instructor Human scoring is time consuming, expensive and limits to create an account, invite students to join a course, and as- the immediacy of feedback that can be provided to the student. sign writing within a course. The prototype reported on here Page described the first system to automatically score essays is an intermediate step toward enabling instructors to write based on analysis of a fairly limited set of features from the and have their own prompts automatically scored. This step essay [16]. Present day automated systems score millions of allows us to test the user interface and methods for sharing student essays in both high stakes and formative engagements the task between the instructor and the system. This system with performance levels at or above human scorers (Sher- learns to modify the scoring of existing, already-modeled mis and Burstein have co-edited comprehensive summaries prompts to more closely represent an instructor’s scoring. In on the subject [18, 19]). These automated scoring systems this current iteration, instructors select from a list of available are typically based on supervised machine learning, where writing prompts, each of which contains a short description, the system is trained on a set of student essays and human a rubric against which to grade student submissions, and a scores. The system derives a set of features for each essay and currently existing automated grading model. Once the prompt learns to infer the human score from the features. A sample has been assigned, students are able to draft and submit their of essays, typically on the order of 300 to 500, are scored by responses. human scorers, and then used to train the automated system. The collection of student responses goes through an active Performance of the automated system is compared to the per- learning preprocessing step to calculate a recommended or- formance of two human raters and, if found acceptable, the dering for the instructor to grade essays. Active learning is automated system then scores the remaining essays. typically employed to reduce human annotation effort, and In the six years since the most current survey of the auto- in our system we use it to minimize the number of human- mated scoring field [19] developments in machine learning graded submissions needed for reliable modeling. Within Apprenticeship model for Essay Scoring IUI Workshops’19, March 20, 2019, Los Angeles, USA Figure 1: The instructor’s grading interface used in our pilot study. Upper Left: AI readiness notification message. Lower Left: Essay prompt details link and student’s essay response. Right: Total score, AI training progress bar, and rubric scoring controls with score of 4 for trait Organization. the instructor interface this is simply the list of submissions As a first step to understand how instructors interact with to grade sorted by the active learning order. In our current AI systems, we decided to not allow instructors direct access implementation, we use the Kennard-Stone algorithm [10]. to a highly complex large parameter space machine learning Kennard-Stone attempts to select submissions in a manner model. Instead, instructors assigned prompts for which there that uniformly covers the feature space by iteratively selecting already existed a fully trained machine learning model. Our the submission with the maximum minimum distance to all implementation uses logistic regression to learn a model that previously selected submissions. We use baseline automated modifies the pre-trained automated scores to better match the scores as our feature space so that the human grader will see instructor’s scoring. The system learned the two parameters of approximately the same number of submissions at each score a logistic regression model to estimate the instructor’s scores point, despite very low and very high scoring submissions based on the instructor’s scores on responses and the scores being rare. In other natural language active learning tasks, bi- from the existing model. The logistic regression functions as a asing the active learner in favor of low-frequency classes has transformation over the pre-trained model scores by adjusting been found to work well [8, 22], and Kennard-Stone has been the distance between score points to more closely match the found to perform well for automated grading in particular [6]. instructor’s scoring behavior. By learning a transformation As the instructor scores submissions, the system begins the over the pre-trained model, we are able to leverage the ac- modeling phase. In the modeling phase, the machine learning curacy of the existing model, while allowing instructors to system is trained to mimic the instructors grades, and its adjust the scoring needs to suit their classroom needs. performance is evaluated. Once the system determines the evaluation is acceptable, the instructors are signaled that the training is complete, and they are able to view the automated 4 PILOT STUDY grades and make adjustments as needed. If the instructor We conducted a pilot with nine instructors and teaching as- corrects the grade, the system refines the model using the sistants for an Introductory Psychology course at a large uni- newly graded submissions. versity. The participants completed an initial training session where they were presented with an overview of the interface, IUI Workshops’19, March 20, 2019, Los Angeles, USA Baikadi, et al. performance to the participants. The interface and interaction implicitly encouraged instructors to regard the system as a tool and to think of system state as bimodal – untrained or trained. The disadvantage of this approach is that it encour- aged instructors to infer that after the transition from untrained to trained the tool’s performance matched their own, but tech- Figure 2: Step 1a - Model training (apprenticeship modeling): nically it meant that a fuzzy threshold had been passed but UI indicates instructor must explicitly train the AI grading as- further monitoring and feedback of automated scoring were sistant. This is the initial messaging. still required. This mismatch likely caused participants to be less vigilant, while survey results indicated participants felt disappointed when they had to correct the nascent automatic scoring. The message "Hooray! The scoring assistant is now calibrated . . . " and the green color of the progress bar implicitly set Figure 3: Step 1b - Model training (apprenticeship modeling): UI indicates current progress in the training phase. incorrect expectations and discouraged participants from car- rying out further review and revision of scores, other than minimal tests to satisfy themselves that the model was perfor- mant. Additionally, during pre-pilot instruction, we suggested and were encouraged to practice on a small set of student writ- that the participants review approximately twenty submis- ing before the start of the pilot. Over the course of the semes- sions after the autoscoring model was enabled. Participants ter, the participants were asked to use our system to grade 100 rarely strayed from these guidelines, reviewing approximately student submissions for each of nine writing prompts. Student twenty submissions on average. In post-surveys, participant submissions were sampled from the participating instructors’ responses indicated that they did not have a strong sense of courses and were anonymized. Prompts were assigned in sets when to stop reviewing. Many would grade until the auto- of three, and then made available to the participants to score. mated scores for a single essay matched their expectations. Participants received prompts to grade in three phases, each Analyses of behavioral data, survey results and users’ feed- consisting of three prompts. Upon logging in, the participants back motivated us to reevaluate our user experience design were able to begin grading by selecting any of the three avail- to better scaffold the user through the process of training and able prompts to begin grading. The prompt was presented to better communicate the expected quality of the automated on a summary screen (Figure 1). The summary screen also scoring model. presents the participants with the suggested order in which to While apprenticeship has been a model of human skill grade the submissions (generated by active learning). building for millennia, Lave was among the first to study and In addition to this ordering, the submissions were further describe it as a formal mode of learning [12]. Collins et al. divided into two sets: One that must be graded by the human further generalized Lave’s observation into what we refer to rater, and one that could have AI feedback. As participants as an apprenticeship model of training [5]. This pedagogy- graded the first set, a progress bar indicated how close the oriented paradigm consists of multiple phases, where the first model was to being trained. They could refer to the summary three are relevant for our application: modeling, coaching, screen to evaluate their progress at any time. Once the prede- and fading. In modeling the apprentice (learner) “repeatedly termined threshold was reached, the participants received a observes the expert performing the target process”. During “Hooray” message, as in Figure 1, and they were then able to coaching, the apprentice “attempts to execute the process review the output of the automated scoring model, adjusted while the expert provides guidance and scaffolds feedback by the logistic regression. They would then review several and instruction”. Lastly, in fading the expert provides less autoscored responses, and adjust the scores as needed. The feedback and eventually ascertains the apprentice’s mastery. regression model would be recalculated after each further This paradigm has been employed for computer supported adjustment. Once the participants were satisfied with the per- collaborative learning (CSCL) and intelligent tutoring sys- formance of the model, they were able to finalize the scoring tems (ITS) where the system regards the user as an apprentice and fix the grades for the rest of the submissions to that to help them develop new skills [3, 11, 13]. prompt. The apprenticeship model provides a useful metaphor for aligning our system’s three stages of data gathering and ap- 5 APPRENTICESHIP MODEL OF TRAINING plication with an accessible, real-world process. Our system The system used in the pilot employed a progress bar with swaps the human-computer relationship typical of ITS and an alert message to communicate the current level of scoring Apprenticeship model for Essay Scoring IUI Workshops’19, March 20, 2019, Los Angeles, USA In the redesigned UI, at the top of each screen a large circle indicates the current location in the process, with messages updating to keep the instructor informed of progress within a given phase. At the beginning of the apprentice modeling phase (Figure 2), instructors are encouraged to score essays to help train the AI grading assistant. As they score more essays they see progress updates as shown in Figure 3. When the AI grading assistant achieves the ability to initially begin scoring, instructors move into the coaching phase (Figure 4). Figure 4: Step 2a - Model tuning/validation (apprenticeship In this phase the instructor scores and then compares their coaching): Messaging informs instructor AI grading assistant score to the AI-assistant (Figure 5). This phase ends when is ready to begin scoring essays. the system gains sufficient confidence in the model’s perfor- mance (e.g., 0.7 to 0.8, values similar to human inter-rater reliability). This is reflected in the instructor’s view by show- ing the instructor’s corrections diminish (Figure 6). Once in the fading phase (Figure 7), the instructor passes scor- ing to the AI-assistant, but still retains the ability to review Figure 5: Step 2b - Model tuning/validation (apprenticeship the AI-assistant’s scores. Messages reinforce the relation- coaching): Evaluation metrics inform instructor of AI grading ship between additional scoring and performance, making assistant’s performance and encourages the instructor to con- the apprentice-relationship of the assistant (e.g., ". . . the more tinue grading. you grade, the more the assistant learns from you") more transparent. The level of the assistant’s learning is indicated by the number and percentage of agreements compared to dis- agreements with the instructor’s grades, and by the progress bar. The progress indicated by the bar follows the number of essays scored, but can accelerate as model performance improves. Figure 6: Step 2c - Model tuning/validation (apprenticeship coaching): Messaging informs instructor AI grading assistant might be able to take over scoring essays. 7 CONCLUSIONS AND FUTURE WORK As more people interact with systems based on sophisticated, often opaque algorithms, it becomes ever more critical to instead considers the user the expert and the AI-assistant per- develop common languages and appropriate metaphors to sona the apprentice. By adopting this framework of model allow communication and common understanding. Often, as training as task modeling; model tuning and validation as these systems move to increasingly common use, a more re- coaching and model acceptance as fading, we help the user fined understanding of how the machine learning component to better understand the expected interactions and responsibil- is trained and what its limitations are, becomes lost. In our ities. first pilot we adopted a quite reasonable view of training an Our coupling of apprenticeship with machine learning is automated scoring system as a tool. Our first set of instructors distinct from the use of apprenticeship in reinforcement learn- internalized this model with unexpected consequences both ing [1], which does not have an interactive human element. for their performance on the task and their satisfaction with completing the task. In moving to the apprentice model, we 6 REDESIGNED USER INTERFACE AND FLOW believe we have found a metaphor that ameliorates some of Based on the feedback from our initial pilot, in our new three- these issues. phase apprenticeship approach, we encourage instructors to Our next steps include conducting pilots with this new view the process of training the automated scoring system as metaphor and a UI/UX that supports it. We have begun to an apprenticeship. In this view, they can reasonably expect the think more broadly about the complex relationships between AI-assistant to continue to learn even after it starts grading. clever systems and equally clever people, both of which have The instructor now expects mistakes to continue, but in di- large blind spots. The instructors know the domain area but minishing number and severity over time. With this approach may have less experience in the type of reliable scoring re- minor mistakes are less likely to damage trust in the system. quired to train an automated scoring model. The AI system IUI Workshops’19, March 20, 2019, Los Angeles, USA Baikadi, et al. 403. Centre for the Study of Reading, University of Illinois, BBN Laboratories, Cambridge, MA. [6] Nicholas Dronen, Peter W. Foltz, and Kyle Habermehl. 2015. Effec- tive Sampling for Large-scale Automated Writing Evaluation Systems. Proceedings of the Second (2015) ACM Conference on Learning @ Scale (2015), 3–10. https://doi.org/10.1145/2724660. 2724661 [7] Youmna Farag, Helen Yannakoudakis, and Ted Briscoe. 2018. Neural automated essay scoring and coherence modeling for adversarially crafted input. arXiv preprint arXiv:1804.06898 (2018). [8] Andrea Horbach and Alexis Palmer. 2016. Investigating Active Learn- ing for Short-Answer Scoring. In BEA@ NAACL-HLT. 301–311. [9] Anders Jonsson and Gunilla Svingby. 2007. The use of scoring rubrics: Reliability, validity and educational consequences. Educational re- search review 2, 2 (2007), 130–144. [10] Ronald W. Kennard and Larry A. Stone. 1969. Computer Aided Design of Experiments. Technometrics: A Journal of Statistics for the Physical, Chemical, and Engineering Sciences 11, 1 (1969), 137–48. [11] S. P. Lajoie and A. M. Lesgold. 1992. Dynamic assessment of profi- ciency for solving procedural knowledge tasks. Educational Psycholo- gist 27 (1992), 365–384. [12] J. Lave. [n. d.]. Tailored learning: Education and everyday practice among craftsmen in West Africa. Technical Report. [13] A. Lesgold, G. Eggan, and G. Rao. 1992. Possibilities for assessment using computer-based apprenticeship environments. Cognitive ap- proaches to automated instruction W. Regian & V. Shute (Eds.) (1992), 49–80. [14] Danielle S. McNamara, Scott A. Corssley, Rod D. Roscoe, Laura K. Figure 7: Step 3 - Model acceptance (apprenticeship fading): In- Allen, and Jianmin Dai. 2015. A hierarchical classification approach to structor has handed over control to AI grading assistant and is automated essay scoring. Assessing Writing 23 (2015), 35–59. encouraged to review scores on remaining essays. [15] Miles Myers. 1980. A procedure for writing assessment and holistic scoring. National Council of Teachers of English, Urbana, IL. [16] Ellis B. Page. 1967. Statistical and linguistic strategies in the computer embodies knowledge about scoring that can be used to scaf- grading of essays. Coling 1967: Conférence Internationale sur le fold the instructor’s scoring, but at the same time is an appren- Traitement Automatique des Langues, Grenoble, France (1967). tice to how the instructor wants the prompt evaluated. How to [17] Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng. 2015. Flexible domain adaptation for automated essay scoring using correlated lin- share the task and how the two agents should communicate ear regression. In Proceedings of the 2015 Conference on Empirical are interesting, open questions. These questions will become Methods in Natural Language Processing. even more relevant as we will begin testing the complete sys- [18] Mark D. Shermis and Jill C. Burstein (Eds.). 2003. Automated essay tem which will now include instructors authoring prompts and scoring: A cross-disciplinary perspective. Lawrence Erlbaum Asso- replacing logistic regression with a full modeling pipeline. ciates, Inc., Mahway, NJ. [19] Mark D. Shermis and Jill C. Burstein (Eds.). 2013. Handbook of automated essay evaluation: Current applications and new directions. REFERENCES Routledge, New York. [1] P. Abbeel and A. Ng. 2004. Apprenticeship Learning via Inverse [20] Dannelle D. Stevens and Antonia J. Levi. 2013. Introduction to Rubrics: Reinforcement Learning. In Proceedings of the 21st International Con- An Assessment Tool to Save Grading Time, Convey Effective Feedback, ference on Machine Learning. and Promote Student Learning. Stylus Publishing, LLC. [2] Gavin TL Brown, Kath Glasswell, and Don Harland. [n. d.]. Accuracy [21] Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to in the scoring of writing: Studies of reliability and validity using a New automated essay scoring. In Proceedings of the 2016 Conference on Zealand writing assessment system. Assessing writing 9, 2 ([n. d.]), Empirical Methods in Natural Language Processing. 105–121. [22] Katrin Tomanek and Udo Hahn. 2009. Reducing Class Imbalance [3] John Seely Brown, R. R Burton, and A. G. Bell. 1975. SOPHIE: A During Active Learning for Named Entity Annotation. In Proceedings Step toward creating a reactive learning environment. International of the Fifth International Conference on Knowledge Capture (K-CAP Journal of Man-Machine Studies 7, 5 (Sept. 1975), 675–696. ’09). ACM, New York, NY, USA, 105–112. https://doi.org/ [4] Jing Chen, James Fife, H, Issac I. Bejar, and André A. Rupp. 2016. 10.1145/1597735.1597754 Building e-rater® Scoring Models Using Machine Learning Methods. [23] Edward W. Wolfe. 2004. Identifying rater effects using latent trait ETS Research Report Series 2016.1 (2016), 1–12. models. Psychology Science 46 (2004), 35–51. [5] A. Collins, J. S. Brown, and S. E. Newman. 1987. Cognitive apprentice- ship: Teaching the craft of reading, writing and mathematics. Technical