=Paper= {{Paper |id=Vol-2327/UIBK2 |storemode=property |title=An apprenticeship model for human and AI collaborative essay grading |pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-UIBK-2.pdf |volume=Vol-2327 |authors=Alok Baikadi,Lee Becker,Jill Budden,Peter Foltz,Andrew Gorman,Scott Hellman,William Murray,Mark Rosenstein |dblpUrl=https://dblp.org/rec/conf/iui/BaikadiBBFGHMR19 }} ==An apprenticeship model for human and AI collaborative essay grading== https://ceur-ws.org/Vol-2327/IUI19WS-UIBK-2.pdf
               An apprenticeship model for human and AI
                      collaborative essay grading
                  Alok Baikadi                                       Lee Becker                                Jill Budden
         alok.baikadi@pearson.com                          leek.becker@pearson.com                     jill.budden@pearson.com
                     Pearson                                           Pearson                                    Pearson
                   Peter Foltz                                   Andrew Gorman                                Scott Hellman
          peter.foltz@pearson.com                         andrew.gorman@pearson.com                   scott.hellman@pearson.com
                     Pearson                                           Pearson                                    Pearson
                                        William Murray                                Mark Rosenstein
                                william.murray@pearson.com                       mark.rosenstein@pearson.com
                                              Pearson                                      Pearson
CCS CONCEPTS                                                                    Over the last 20 years, automated scoring of student writ-
• Human-centered computing → Human computer inter-                           ing through the use of natural language processing, machine
action (HCI).                                                                learning and artificial intelligence techniques coupled with
                                                                             human-scored training sets has in many applications achieved
KEYWORDS                                                                     performance comparable to that of humans [19]. In high
Human-ai collaboration; Intelligent user interfaces; Machine                 stakes testing, millions of students have had their writing
learning; Active learning; Automated essay grading                           automatically scored, with the prompts, rubrics and human
                                                                             scoring for training and validating model performance imple-
ACM Reference Format:                                                        mented in tightly controlled frameworks. Our work addresses
Alok Baikadi, Lee Becker, Jill Budden, Peter Foltz, Andrew Gorman,           the question of how to move this technology to a wider au-
Scott Hellman, William Murray, and Mark Rosenstein. 2019. An
                                                                             dience and adding formative assessment to have a broader
apprenticeship model for human and AI collaborative essay grading.
In Joint Proceedings of the ACM IUI 2019 Workshops, Los Angeles,
                                                                             impact in helping students learn to write.
USA, March 20, 2019 , 6 pages.                                                  Our goal is to lower the barriers that limit an instructor’s
                                                                             ability to assign essays in their courses. Our approach is
1   INTRODUCTION                                                             to develop a system in which instructors develop prompts
                                                                             appropriate for their course and, by scoring a subset of their
Across a range of domains, humans and complex algorithms
                                                                             student responses, are able to turn on an automated system
embedded in systems are collaborating and coordinating ac-
                                                                             to score the rest of the responses. These prompts and the
tion. Tasks are shared because the combined system can be
                                                                             ability to automatically score them become a resource that
more capable than either agent acting alone. Such systems
                                                                             instructors can reuse and even share. The critical issue is
with shared autonomy raise important research questions in
                                                                             that while the instructor is an expert in their domain, they
how to design these joint systems to best utilize each of the
                                                                             are likely not an expert in either assessment or the machine
actors capabilities for optimal performance while addressing
                                                                             learning techniques that make automated scoring possible.
important safety, legal and ethical issues.
                                                                                We have piloted a prototype system in a large introductory
   Our work investigates students developing their writing
                                                                             psychology course at a large university. This pilot explored
ability throughout their educational career, since writing pro-
                                                                             the issues of 1) transferring scoring expertise from an instruc-
ficiency is a critical life and career competency. Writing is a
                                                                             tor to an automated system and 2) using an automated system
skill that develops through practice and feedback. However,
                                                                             to provide feedback to the instructor about the quality of the
the massive effort required of instructors in providing feed-
                                                                             current state of its scoring. In most end user applications of AI,
back on drafts and scoring final versions is a limiting factor in
                                                                             the user is only exposed to decisions or behavior from some
assigning essays and short answer problems in their classes.
                                                                             unseen, unknown model, and training of the machine learning
                                                                             mechanism is either hidden or taken for granted by the user.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                           Exposing machine learning flows to users unversed in the no-
© 2019 Copyright © 2019 for the individual papers by the papers’ authors.    tions of model performance and evaluation raises interesting
Copying permitted for private and academic purposes. This volume is pub-     design questions around user trust, system transparency, and
lished and copyrighted by its editors.                                       managing expectations.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                   Baikadi, et al.

   We discuss the approach used for the pilot and some of the      have impacted the modeling choices in both research and in
issues that emerged. We then show how adopting a metaphor          commercial systems. These techniques include hierarchical
of apprenticeship clarifies the communication between the          classification [14], correlated linear regression [17] and var-
instructor and the AI assistant. In this paper we discuss the      ious neural net deeplearning approaches [7, 21] and many
issues of shared autonomy that arise in such a system and          others. In addition, some commercial systems have described
the issues we have seen, in defining the task, in making re-       their modeling subsystems, e.g. [4].
sults mutually interpretable to the instructor and the machine,       As we move away from high stakes scoring with precisely
and in designing user interfaces that makes more transparent       trained models to formative scoring with instructor-trained
all the various factors that are required for making correct       models (and the use of automated scoring in the classroom),
decisions on how the task should proceed.                          the burden of generating reliable scores to train the automated
                                                                   system now falls on the instructor. For the automated sys-
2   RELIABLE SCORING                                               tem to reliably score, the instructor must score a sufficiently
Current high stakes scoring implementations attempt to achieve     large number of essays to capture the variability of student
reliability through various means: the use of rubrics to specify   responses and do so in a sufficiently reliable fashion to allow
the results that must be evidenced at each score point [9, 20];    the regularities of scoring to be learnable by machine learn-
anchor papers, which are example, actual essays selected to        ing techniques. In the system we have developed, where the
indicate typical and boundary score point answers [15]; and        system learns from an instructor’s scoring behavior, the in-
supervised scorer training, which often includes practice scor-    structor only need score enough essays to build a performant
ing exercises and required levels of performance on example        model. The hurdle is that, as the instructor scores, the system
essays. Yet despite these rigorous preparations, raters still      needs to provide feedback to the instructor as the instructor
disagree. Psychometricians have developed methods to detect        scores on how well the current model is performing. In an
subtle biases in raters referred to as rater effects [23]. These   intelligible manner, the system must update the instructor on
techniques can allow performance monitoring during scoring         its progress. The AI system must continually provide infor-
and detection after scoring.                                       mation to the instructor to allow the instructor to make an
   In complex tasks, such as writing an essay, even with well-     informed decision about the quality of the automated scoring
trained scorers without rater effects there will be an expected    and determine when it is justifiable to turn scoring over to the
level of disagreement over the score of individual essays. A       automated system.
number of studies have found that in well-designed prompts
and rubrics with well-trained scorers, the expected range of       3   SYSTEM DESCRIPTION
adjacent agreement, i.e., scores within 1 point, is 80-99% and     We have developed a prototype system which allows instruc-
correlations in the 0.7 to 0.8 range (Brown et al. [2] provide a   tors to assign writing to their students and participate in AI-
summary of research and standards in this area).                   assisted grading workflows. The system allows an instructor
   Human scoring is time consuming, expensive and limits           to create an account, invite students to join a course, and as-
the immediacy of feedback that can be provided to the student.     sign writing within a course. The prototype reported on here
Page described the first system to automatically score essays      is an intermediate step toward enabling instructors to write
based on analysis of a fairly limited set of features from the     and have their own prompts automatically scored. This step
essay [16]. Present day automated systems score millions of        allows us to test the user interface and methods for sharing
student essays in both high stakes and formative engagements       the task between the instructor and the system. This system
with performance levels at or above human scorers (Sher-           learns to modify the scoring of existing, already-modeled
mis and Burstein have co-edited comprehensive summaries            prompts to more closely represent an instructor’s scoring. In
on the subject [18, 19]). These automated scoring systems          this current iteration, instructors select from a list of available
are typically based on supervised machine learning, where          writing prompts, each of which contains a short description,
the system is trained on a set of student essays and human         a rubric against which to grade student submissions, and a
scores. The system derives a set of features for each essay and    currently existing automated grading model. Once the prompt
learns to infer the human score from the features. A sample        has been assigned, students are able to draft and submit their
of essays, typically on the order of 300 to 500, are scored by     responses.
human scorers, and then used to train the automated system.           The collection of student responses goes through an active
Performance of the automated system is compared to the per-        learning preprocessing step to calculate a recommended or-
formance of two human raters and, if found acceptable, the         dering for the instructor to grade essays. Active learning is
automated system then scores the remaining essays.                 typically employed to reduce human annotation effort, and
   In the six years since the most current survey of the auto-     in our system we use it to minimize the number of human-
mated scoring field [19] developments in machine learning          graded submissions needed for reliable modeling. Within
Apprenticeship model for Essay Scoring                                    IUI Workshops’19, March 20, 2019, Los Angeles, USA




Figure 1: The instructor’s grading interface used in our pilot study. Upper Left: AI readiness notification message. Lower Left: Essay
prompt details link and student’s essay response. Right: Total score, AI training progress bar, and rubric scoring controls with score
of 4 for trait Organization.



the instructor interface this is simply the list of submissions          As a first step to understand how instructors interact with
to grade sorted by the active learning order. In our current          AI systems, we decided to not allow instructors direct access
implementation, we use the Kennard-Stone algorithm [10].              to a highly complex large parameter space machine learning
Kennard-Stone attempts to select submissions in a manner              model. Instead, instructors assigned prompts for which there
that uniformly covers the feature space by iteratively selecting      already existed a fully trained machine learning model. Our
the submission with the maximum minimum distance to all               implementation uses logistic regression to learn a model that
previously selected submissions. We use baseline automated            modifies the pre-trained automated scores to better match the
scores as our feature space so that the human grader will see         instructor’s scoring. The system learned the two parameters of
approximately the same number of submissions at each score            a logistic regression model to estimate the instructor’s scores
point, despite very low and very high scoring submissions             based on the instructor’s scores on responses and the scores
being rare. In other natural language active learning tasks, bi-      from the existing model. The logistic regression functions as a
asing the active learner in favor of low-frequency classes has        transformation over the pre-trained model scores by adjusting
been found to work well [8, 22], and Kennard-Stone has been           the distance between score points to more closely match the
found to perform well for automated grading in particular [6].        instructor’s scoring behavior. By learning a transformation
   As the instructor scores submissions, the system begins the        over the pre-trained model, we are able to leverage the ac-
modeling phase. In the modeling phase, the machine learning           curacy of the existing model, while allowing instructors to
system is trained to mimic the instructors grades, and its            adjust the scoring needs to suit their classroom needs.
performance is evaluated. Once the system determines the
evaluation is acceptable, the instructors are signaled that the
training is complete, and they are able to view the automated         4   PILOT STUDY
grades and make adjustments as needed. If the instructor              We conducted a pilot with nine instructors and teaching as-
corrects the grade, the system refines the model using the            sistants for an Introductory Psychology course at a large uni-
newly graded submissions.                                             versity. The participants completed an initial training session
                                                                      where they were presented with an overview of the interface,
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                   Baikadi, et al.

                                                                   performance to the participants. The interface and interaction
                                                                   implicitly encouraged instructors to regard the system as a
                                                                   tool and to think of system state as bimodal – untrained or
                                                                   trained. The disadvantage of this approach is that it encour-
                                                                   aged instructors to infer that after the transition from untrained
                                                                   to trained the tool’s performance matched their own, but tech-
Figure 2: Step 1a - Model training (apprenticeship modeling):      nically it meant that a fuzzy threshold had been passed but
UI indicates instructor must explicitly train the AI grading as-   further monitoring and feedback of automated scoring were
sistant. This is the initial messaging.                            still required.
                                                                        This mismatch likely caused participants to be less vigilant,
                                                                   while survey results indicated participants felt disappointed
                                                                   when they had to correct the nascent automatic scoring. The
                                                                   message "Hooray! The scoring assistant is now calibrated
                                                                   . . . " and the green color of the progress bar implicitly set
Figure 3: Step 1b - Model training (apprenticeship modeling):
UI indicates current progress in the training phase.
                                                                   incorrect expectations and discouraged participants from car-
                                                                   rying out further review and revision of scores, other than
                                                                   minimal tests to satisfy themselves that the model was perfor-
                                                                   mant. Additionally, during pre-pilot instruction, we suggested
and were encouraged to practice on a small set of student writ-    that the participants review approximately twenty submis-
ing before the start of the pilot. Over the course of the semes-   sions after the autoscoring model was enabled. Participants
ter, the participants were asked to use our system to grade 100    rarely strayed from these guidelines, reviewing approximately
student submissions for each of nine writing prompts. Student      twenty submissions on average. In post-surveys, participant
submissions were sampled from the participating instructors’       responses indicated that they did not have a strong sense of
courses and were anonymized. Prompts were assigned in sets         when to stop reviewing. Many would grade until the auto-
of three, and then made available to the participants to score.    mated scores for a single essay matched their expectations.
   Participants received prompts to grade in three phases, each    Analyses of behavioral data, survey results and users’ feed-
consisting of three prompts. Upon logging in, the participants     back motivated us to reevaluate our user experience design
were able to begin grading by selecting any of the three avail-    to better scaffold the user through the process of training and
able prompts to begin grading. The prompt was presented            to better communicate the expected quality of the automated
on a summary screen (Figure 1). The summary screen also            scoring model.
presents the participants with the suggested order in which to          While apprenticeship has been a model of human skill
grade the submissions (generated by active learning).              building for millennia, Lave was among the first to study and
   In addition to this ordering, the submissions were further      describe it as a formal mode of learning [12]. Collins et al.
divided into two sets: One that must be graded by the human        further generalized Lave’s observation into what we refer to
rater, and one that could have AI feedback. As participants        as an apprenticeship model of training [5]. This pedagogy-
graded the first set, a progress bar indicated how close the       oriented paradigm consists of multiple phases, where the first
model was to being trained. They could refer to the summary        three are relevant for our application: modeling, coaching,
screen to evaluate their progress at any time. Once the prede-     and fading. In modeling the apprentice (learner) “repeatedly
termined threshold was reached, the participants received a        observes the expert performing the target process”. During
“Hooray” message, as in Figure 1, and they were then able to       coaching, the apprentice “attempts to execute the process
review the output of the automated scoring model, adjusted         while the expert provides guidance and scaffolds feedback
by the logistic regression. They would then review several         and instruction”. Lastly, in fading the expert provides less
autoscored responses, and adjust the scores as needed. The         feedback and eventually ascertains the apprentice’s mastery.
regression model would be recalculated after each further               This paradigm has been employed for computer supported
adjustment. Once the participants were satisfied with the per-     collaborative learning (CSCL) and intelligent tutoring sys-
formance of the model, they were able to finalize the scoring      tems (ITS) where the system regards the user as an apprentice
and fix the grades for the rest of the submissions to that         to help them develop new skills [3, 11, 13].
prompt.                                                                 The apprenticeship model provides a useful metaphor for
                                                                   aligning our system’s three stages of data gathering and ap-
5   APPRENTICESHIP MODEL OF TRAINING                               plication with an accessible, real-world process. Our system
The system used in the pilot employed a progress bar with          swaps the human-computer relationship typical of ITS and
an alert message to communicate the current level of scoring
Apprenticeship model for Essay Scoring                                IUI Workshops’19, March 20, 2019, Los Angeles, USA

                                                                     In the redesigned UI, at the top of each screen a large circle
                                                                  indicates the current location in the process, with messages
                                                                  updating to keep the instructor informed of progress within
                                                                  a given phase. At the beginning of the apprentice modeling
                                                                  phase (Figure 2), instructors are encouraged to score essays to
                                                                  help train the AI grading assistant. As they score more essays
                                                                  they see progress updates as shown in Figure 3. When the AI
                                                                  grading assistant achieves the ability to initially begin scoring,
                                                                  instructors move into the coaching phase (Figure 4).
Figure 4: Step 2a - Model tuning/validation (apprenticeship          In this phase the instructor scores and then compares their
coaching): Messaging informs instructor AI grading assistant      score to the AI-assistant (Figure 5). This phase ends when
is ready to begin scoring essays.                                 the system gains sufficient confidence in the model’s perfor-
                                                                  mance (e.g., 0.7 to 0.8, values similar to human inter-rater
                                                                  reliability). This is reflected in the instructor’s view by show-
                                                                  ing the instructor’s corrections diminish (Figure 6). Once
                                                                  in the fading phase (Figure 7), the instructor passes scor-
                                                                  ing to the AI-assistant, but still retains the ability to review
Figure 5: Step 2b - Model tuning/validation (apprenticeship       the AI-assistant’s scores. Messages reinforce the relation-
coaching): Evaluation metrics inform instructor of AI grading     ship between additional scoring and performance, making
assistant’s performance and encourages the instructor to con-     the apprentice-relationship of the assistant (e.g., ". . . the more
tinue grading.                                                    you grade, the more the assistant learns from you") more
                                                                  transparent. The level of the assistant’s learning is indicated
                                                                  by the number and percentage of agreements compared to dis-
                                                                  agreements with the instructor’s grades, and by the progress
                                                                  bar. The progress indicated by the bar follows the number
                                                                  of essays scored, but can accelerate as model performance
                                                                  improves.

Figure 6: Step 2c - Model tuning/validation (apprenticeship
coaching): Messaging informs instructor AI grading assistant
might be able to take over scoring essays.                        7   CONCLUSIONS AND FUTURE WORK
                                                                  As more people interact with systems based on sophisticated,
                                                                  often opaque algorithms, it becomes ever more critical to
instead considers the user the expert and the AI-assistant per-   develop common languages and appropriate metaphors to
sona the apprentice. By adopting this framework of model          allow communication and common understanding. Often, as
training as task modeling; model tuning and validation as         these systems move to increasingly common use, a more re-
coaching and model acceptance as fading, we help the user         fined understanding of how the machine learning component
to better understand the expected interactions and responsibil-   is trained and what its limitations are, becomes lost. In our
ities.                                                            first pilot we adopted a quite reasonable view of training an
   Our coupling of apprenticeship with machine learning is        automated scoring system as a tool. Our first set of instructors
distinct from the use of apprenticeship in reinforcement learn-   internalized this model with unexpected consequences both
ing [1], which does not have an interactive human element.        for their performance on the task and their satisfaction with
                                                                  completing the task. In moving to the apprentice model, we
6   REDESIGNED USER INTERFACE AND FLOW                            believe we have found a metaphor that ameliorates some of
Based on the feedback from our initial pilot, in our new three-   these issues.
phase apprenticeship approach, we encourage instructors to           Our next steps include conducting pilots with this new
view the process of training the automated scoring system as      metaphor and a UI/UX that supports it. We have begun to
an apprenticeship. In this view, they can reasonably expect the   think more broadly about the complex relationships between
AI-assistant to continue to learn even after it starts grading.   clever systems and equally clever people, both of which have
The instructor now expects mistakes to continue, but in di-       large blind spots. The instructors know the domain area but
minishing number and severity over time. With this approach       may have less experience in the type of reliable scoring re-
minor mistakes are less likely to damage trust in the system.     quired to train an automated scoring model. The AI system
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                                           Baikadi, et al.

                                                                                       403. Centre for the Study of Reading, University of Illinois, BBN
                                                                                       Laboratories, Cambridge, MA.
                                                                                   [6] Nicholas Dronen, Peter W. Foltz, and Kyle Habermehl. 2015. Effec-
                                                                                       tive Sampling for Large-scale Automated Writing Evaluation Systems.
                                                                                       Proceedings of the Second (2015) ACM Conference on Learning @
                                                                                       Scale (2015), 3–10. https://doi.org/10.1145/2724660.
                                                                                       2724661
                                                                                   [7] Youmna Farag, Helen Yannakoudakis, and Ted Briscoe. 2018. Neural
                                                                                       automated essay scoring and coherence modeling for adversarially
                                                                                       crafted input. arXiv preprint arXiv:1804.06898 (2018).
                                                                                   [8] Andrea Horbach and Alexis Palmer. 2016. Investigating Active Learn-
                                                                                       ing for Short-Answer Scoring. In BEA@ NAACL-HLT. 301–311.
                                                                                   [9] Anders Jonsson and Gunilla Svingby. 2007. The use of scoring rubrics:
                                                                                       Reliability, validity and educational consequences. Educational re-
                                                                                       search review 2, 2 (2007), 130–144.
                                                                                  [10] Ronald W. Kennard and Larry A. Stone. 1969. Computer Aided Design
                                                                                       of Experiments. Technometrics: A Journal of Statistics for the Physical,
                                                                                       Chemical, and Engineering Sciences 11, 1 (1969), 137–48.
                                                                                  [11] S. P. Lajoie and A. M. Lesgold. 1992. Dynamic assessment of profi-
                                                                                       ciency for solving procedural knowledge tasks. Educational Psycholo-
                                                                                       gist 27 (1992), 365–384.
                                                                                  [12] J. Lave. [n. d.]. Tailored learning: Education and everyday practice
                                                                                       among craftsmen in West Africa. Technical Report.
                                                                                  [13] A. Lesgold, G. Eggan, and G. Rao. 1992. Possibilities for assessment
                                                                                       using computer-based apprenticeship environments. Cognitive ap-
                                                                                       proaches to automated instruction W. Regian & V. Shute (Eds.) (1992),
                                                                                       49–80.
                                                                                  [14] Danielle S. McNamara, Scott A. Corssley, Rod D. Roscoe, Laura K.
Figure 7: Step 3 - Model acceptance (apprenticeship fading): In-                       Allen, and Jianmin Dai. 2015. A hierarchical classification approach to
structor has handed over control to AI grading assistant and is                        automated essay scoring. Assessing Writing 23 (2015), 35–59.
encouraged to review scores on remaining essays.                                  [15] Miles Myers. 1980. A procedure for writing assessment and holistic
                                                                                       scoring. National Council of Teachers of English, Urbana, IL.
                                                                                  [16] Ellis B. Page. 1967. Statistical and linguistic strategies in the computer
embodies knowledge about scoring that can be used to scaf-                             grading of essays. Coling 1967: Conférence Internationale sur le
fold the instructor’s scoring, but at the same time is an appren-                      Traitement Automatique des Langues, Grenoble, France (1967).
tice to how the instructor wants the prompt evaluated. How to                     [17] Peter Phandi, Kian Ming A. Chai, and Hwee Tou Ng. 2015. Flexible
                                                                                       domain adaptation for automated essay scoring using correlated lin-
share the task and how the two agents should communicate                               ear regression. In Proceedings of the 2015 Conference on Empirical
are interesting, open questions. These questions will become                           Methods in Natural Language Processing.
even more relevant as we will begin testing the complete sys-                     [18] Mark D. Shermis and Jill C. Burstein (Eds.). 2003. Automated essay
tem which will now include instructors authoring prompts and                           scoring: A cross-disciplinary perspective. Lawrence Erlbaum Asso-
replacing logistic regression with a full modeling pipeline.                           ciates, Inc., Mahway, NJ.
                                                                                  [19] Mark D. Shermis and Jill C. Burstein (Eds.). 2013. Handbook of
                                                                                       automated essay evaluation: Current applications and new directions.
REFERENCES
                                                                                       Routledge, New York.
 [1] P. Abbeel and A. Ng. 2004. Apprenticeship Learning via Inverse               [20] Dannelle D. Stevens and Antonia J. Levi. 2013. Introduction to Rubrics:
     Reinforcement Learning. In Proceedings of the 21st International Con-             An Assessment Tool to Save Grading Time, Convey Effective Feedback,
     ference on Machine Learning.                                                      and Promote Student Learning. Stylus Publishing, LLC.
 [2] Gavin TL Brown, Kath Glasswell, and Don Harland. [n. d.]. Accuracy           [21] Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to
     in the scoring of writing: Studies of reliability and validity using a New        automated essay scoring. In Proceedings of the 2016 Conference on
     Zealand writing assessment system. Assessing writing 9, 2 ([n. d.]),              Empirical Methods in Natural Language Processing.
     105–121.                                                                     [22] Katrin Tomanek and Udo Hahn. 2009. Reducing Class Imbalance
 [3] John Seely Brown, R. R Burton, and A. G. Bell. 1975. SOPHIE: A                    During Active Learning for Named Entity Annotation. In Proceedings
     Step toward creating a reactive learning environment. International               of the Fifth International Conference on Knowledge Capture (K-CAP
     Journal of Man-Machine Studies 7, 5 (Sept. 1975), 675–696.                       ’09). ACM, New York, NY, USA, 105–112. https://doi.org/
 [4] Jing Chen, James Fife, H, Issac I. Bejar, and André A. Rupp. 2016.                10.1145/1597735.1597754
     Building e-rater® Scoring Models Using Machine Learning Methods.             [23] Edward W. Wolfe. 2004. Identifying rater effects using latent trait
     ETS Research Report Series 2016.1 (2016), 1–12.                                   models. Psychology Science 46 (2004), 35–51.
 [5] A. Collins, J. S. Brown, and S. E. Newman. 1987. Cognitive apprentice-
     ship: Teaching the craft of reading, writing and mathematics. Technical