<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshops, Los Angeles,
USA, March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An apprenticeship model for human and AI collaborative essay grading</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alok Baikadi</string-name>
          <email>alok.baikadi@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Foltz</string-name>
          <email>peter.foltz@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lee Becker</string-name>
          <email>leek.becker@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Gorman</string-name>
          <email>andrew.gorman@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jill Budden</string-name>
          <email>jill.budden@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Scott Hellman</string-name>
          <email>scott.hellman@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William Murray</string-name>
          <email>william.murray@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Rosenstein</string-name>
          <email>mark.rosenstein@pearson.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pearson</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>20</volume>
      <issue>2019</issue>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CCS CONCEPTS
• Human-centered computing → Human computer
interaction (HCI).
1 INTRODUCTION
Across a range of domains, humans and complex algorithms
embedded in systems are collaborating and coordinating
action. Tasks are shared because the combined system can be
more capable than either agent acting alone. Such systems
with shared autonomy raise important research questions in
how to design these joint systems to best utilize each of the
actors capabilities for optimal performance while addressing
important safety, legal and ethical issues.</p>
      <p>Our work investigates students developing their writing
ability throughout their educational career, since writing
proifciency is a critical life and career competency. Writing is a
skill that develops through practice and feedback. However,
the massive effort required of instructors in providing
feedback on drafts and scoring final versions is a limiting factor in
assigning essays and short answer problems in their classes.
IUI Workshops’19, March 20, 2019, Los Angeles, USA
© 2019 Copyright © 2019 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. This volume is
published and copyrighted by its editors.</p>
      <p>
        Over the last 20 years, automated scoring of student
writing through the use of natural language processing, machine
learning and artificial intelligence techniques coupled with
human-scored training sets has in many applications achieved
performance comparable to that of humans [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In high
stakes testing, millions of students have had their writing
automatically scored, with the prompts, rubrics and human
scoring for training and validating model performance
implemented in tightly controlled frameworks. Our work addresses
the question of how to move this technology to a wider
audience and adding formative assessment to have a broader
impact in helping students learn to write.
      </p>
      <p>Our goal is to lower the barriers that limit an instructor’s
ability to assign essays in their courses. Our approach is
to develop a system in which instructors develop prompts
appropriate for their course and, by scoring a subset of their
student responses, are able to turn on an automated system
to score the rest of the responses. These prompts and the
ability to automatically score them become a resource that
instructors can reuse and even share. The critical issue is
that while the instructor is an expert in their domain, they
are likely not an expert in either assessment or the machine
learning techniques that make automated scoring possible.</p>
      <p>We have piloted a prototype system in a large introductory
psychology course at a large university. This pilot explored
the issues of 1) transferring scoring expertise from an
instructor to an automated system and 2) using an automated system
to provide feedback to the instructor about the quality of the
current state of its scoring. In most end user applications of AI,
the user is only exposed to decisions or behavior from some
unseen, unknown model, and training of the machine learning
mechanism is either hidden or taken for granted by the user.
Exposing machine learning flows to users unversed in the
notions of model performance and evaluation raises interesting
design questions around user trust, system transparency, and
managing expectations.</p>
      <p>We discuss the approach used for the pilot and some of the
issues that emerged. We then show how adopting a metaphor
of apprenticeship clarifies the communication between the
instructor and the AI assistant. In this paper we discuss the
issues of shared autonomy that arise in such a system and
the issues we have seen, in defining the task, in making
results mutually interpretable to the instructor and the machine,
and in designing user interfaces that makes more transparent
all the various factors that are required for making correct
decisions on how the task should proceed.
2</p>
      <p>
        RELIABLE SCORING
Current high stakes scoring implementations attempt to achieve
reliability through various means: the use of rubrics to specify
the results that must be evidenced at each score point [
        <xref ref-type="bibr" rid="ref20 ref9">9, 20</xref>
        ];
anchor papers, which are example, actual essays selected to
indicate typical and boundary score point answers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; and
supervised scorer training, which often includes practice
scoring exercises and required levels of performance on example
essays. Yet despite these rigorous preparations, raters still
disagree. Psychometricians have developed methods to detect
subtle biases in raters referred to as rater effects [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. These
techniques can allow performance monitoring during scoring
and detection after scoring.
      </p>
      <p>
        In complex tasks, such as writing an essay, even with
welltrained scorers without rater effects there will be an expected
level of disagreement over the score of individual essays. A
number of studies have found that in well-designed prompts
and rubrics with well-trained scorers, the expected range of
adjacent agreement, i.e., scores within 1 point, is 80-99% and
correlations in the 0.7 to 0.8 range (Brown et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provide a
summary of research and standards in this area).
      </p>
      <p>
        Human scoring is time consuming, expensive and limits
the immediacy of feedback that can be provided to the student.
Page described the first system to automatically score essays
based on analysis of a fairly limited set of features from the
essay [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Present day automated systems score millions of
student essays in both high stakes and formative engagements
with performance levels at or above human scorers
(Shermis and Burstein have co-edited comprehensive summaries
on the subject [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]). These automated scoring systems
are typically based on supervised machine learning, where
the system is trained on a set of student essays and human
scores. The system derives a set of features for each essay and
learns to infer the human score from the features. A sample
of essays, typically on the order of 300 to 500, are scored by
human scorers, and then used to train the automated system.
Performance of the automated system is compared to the
performance of two human raters and, if found acceptable, the
automated system then scores the remaining essays.
      </p>
      <p>
        In the six years since the most current survey of the
automated scoring field [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] developments in machine learning
have impacted the modeling choices in both research and in
commercial systems. These techniques include hierarchical
classification [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], correlated linear regression [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and
various neural net deeplearning approaches [
        <xref ref-type="bibr" rid="ref21 ref7">7, 21</xref>
        ] and many
others. In addition, some commercial systems have described
their modeling subsystems, e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>As we move away from high stakes scoring with precisely
trained models to formative scoring with instructor-trained
models (and the use of automated scoring in the classroom),
the burden of generating reliable scores to train the automated
system now falls on the instructor. For the automated
system to reliably score, the instructor must score a sufficiently
large number of essays to capture the variability of student
responses and do so in a sufficiently reliable fashion to allow
the regularities of scoring to be learnable by machine
learning techniques. In the system we have developed, where the
system learns from an instructor’s scoring behavior, the
instructor only need score enough essays to build a performant
model. The hurdle is that, as the instructor scores, the system
needs to provide feedback to the instructor as the instructor
scores on how well the current model is performing. In an
intelligible manner, the system must update the instructor on
its progress. The AI system must continually provide
information to the instructor to allow the instructor to make an
informed decision about the quality of the automated scoring
and determine when it is justifiable to turn scoring over to the
automated system.
3</p>
      <p>SYSTEM DESCRIPTION
We have developed a prototype system which allows
instructors to assign writing to their students and participate in
AIassisted grading workflows. The system allows an instructor
to create an account, invite students to join a course, and
assign writing within a course. The prototype reported on here
is an intermediate step toward enabling instructors to write
and have their own prompts automatically scored. This step
allows us to test the user interface and methods for sharing
the task between the instructor and the system. This system
learns to modify the scoring of existing, already-modeled
prompts to more closely represent an instructor’s scoring. In
this current iteration, instructors select from a list of available
writing prompts, each of which contains a short description,
a rubric against which to grade student submissions, and a
currently existing automated grading model. Once the prompt
has been assigned, students are able to draft and submit their
responses.</p>
      <p>
        The collection of student responses goes through an active
learning preprocessing step to calculate a recommended
ordering for the instructor to grade essays. Active learning is
typically employed to reduce human annotation effort, and
in our system we use it to minimize the number of
humangraded submissions needed for reliable modeling. Within
the instructor interface this is simply the list of submissions
to grade sorted by the active learning order. In our current
implementation, we use the Kennard-Stone algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Kennard-Stone attempts to select submissions in a manner
that uniformly covers the feature space by iteratively selecting
the submission with the maximum minimum distance to all
previously selected submissions. We use baseline automated
scores as our feature space so that the human grader will see
approximately the same number of submissions at each score
point, despite very low and very high scoring submissions
being rare. In other natural language active learning tasks,
biasing the active learner in favor of low-frequency classes has
been found to work well [
        <xref ref-type="bibr" rid="ref22 ref8">8, 22</xref>
        ], and Kennard-Stone has been
found to perform well for automated grading in particular [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>As the instructor scores submissions, the system begins the
modeling phase. In the modeling phase, the machine learning
system is trained to mimic the instructors grades, and its
performance is evaluated. Once the system determines the
evaluation is acceptable, the instructors are signaled that the
training is complete, and they are able to view the automated
grades and make adjustments as needed. If the instructor
corrects the grade, the system refines the model using the
newly graded submissions.</p>
      <p>As a first step to understand how instructors interact with
AI systems, we decided to not allow instructors direct access
to a highly complex large parameter space machine learning
model. Instead, instructors assigned prompts for which there
already existed a fully trained machine learning model. Our
implementation uses logistic regression to learn a model that
modifies the pre-trained automated scores to better match the
instructor’s scoring. The system learned the two parameters of
a logistic regression model to estimate the instructor’s scores
based on the instructor’s scores on responses and the scores
from the existing model. The logistic regression functions as a
transformation over the pre-trained model scores by adjusting
the distance between score points to more closely match the
instructor’s scoring behavior. By learning a transformation
over the pre-trained model, we are able to leverage the
accuracy of the existing model, while allowing instructors to
adjust the scoring needs to suit their classroom needs.
4</p>
      <p>PILOT STUDY
We conducted a pilot with nine instructors and teaching
assistants for an Introductory Psychology course at a large
university. The participants completed an initial training session
where they were presented with an overview of the interface,
and were encouraged to practice on a small set of student
writing before the start of the pilot. Over the course of the
semester, the participants were asked to use our system to grade 100
student submissions for each of nine writing prompts. Student
submissions were sampled from the participating instructors’
courses and were anonymized. Prompts were assigned in sets
of three, and then made available to the participants to score.</p>
      <p>Participants received prompts to grade in three phases, each
consisting of three prompts. Upon logging in, the participants
were able to begin grading by selecting any of the three
available prompts to begin grading. The prompt was presented
on a summary screen (Figure 1). The summary screen also
presents the participants with the suggested order in which to
grade the submissions (generated by active learning).</p>
      <p>In addition to this ordering, the submissions were further
divided into two sets: One that must be graded by the human
rater, and one that could have AI feedback. As participants
graded the first set, a progress bar indicated how close the
model was to being trained. They could refer to the summary
screen to evaluate their progress at any time. Once the
predetermined threshold was reached, the participants received a
“Hooray” message, as in Figure 1, and they were then able to
review the output of the automated scoring model, adjusted
by the logistic regression. They would then review several
autoscored responses, and adjust the scores as needed. The
regression model would be recalculated after each further
adjustment. Once the participants were satisfied with the
performance of the model, they were able to finalize the scoring
and fix the grades for the rest of the submissions to that
prompt.
5</p>
      <p>APPRENTICESHIP MODEL OF TRAINING
The system used in the pilot employed a progress bar with
an alert message to communicate the current level of scoring
performance to the participants. The interface and interaction
implicitly encouraged instructors to regard the system as a
tool and to think of system state as bimodal – untrained or
trained. The disadvantage of this approach is that it
encouraged instructors to infer that after the transition from untrained
to trained the tool’s performance matched their own, but
technically it meant that a fuzzy threshold had been passed but
further monitoring and feedback of automated scoring were
still required.</p>
      <p>This mismatch likely caused participants to be less vigilant,
while survey results indicated participants felt disappointed
when they had to correct the nascent automatic scoring. The
message "Hooray! The scoring assistant is now calibrated
. . . " and the green color of the progress bar implicitly set
incorrect expectations and discouraged participants from
carrying out further review and revision of scores, other than
minimal tests to satisfy themselves that the model was
performant. Additionally, during pre-pilot instruction, we suggested
that the participants review approximately twenty
submissions after the autoscoring model was enabled. Participants
rarely strayed from these guidelines, reviewing approximately
twenty submissions on average. In post-surveys, participant
responses indicated that they did not have a strong sense of
when to stop reviewing. Many would grade until the
automated scores for a single essay matched their expectations.
Analyses of behavioral data, survey results and users’
feedback motivated us to reevaluate our user experience design
to better scaffold the user through the process of training and
to better communicate the expected quality of the automated
scoring model.</p>
      <p>
        While apprenticeship has been a model of human skill
building for millennia, Lave was among the first to study and
describe it as a formal mode of learning [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Collins et al.
further generalized Lave’s observation into what we refer to
as an apprenticeship model of training [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
pedagogyoriented paradigm consists of multiple phases, where the first
three are relevant for our application: modeling, coaching,
and fading. In modeling the apprentice (learner) “repeatedly
observes the expert performing the target process”. During
coaching, the apprentice “attempts to execute the process
while the expert provides guidance and scaffolds feedback
and instruction”. Lastly, in fading the expert provides less
feedback and eventually ascertains the apprentice’s mastery.
      </p>
      <p>
        This paradigm has been employed for computer supported
collaborative learning (CSCL) and intelligent tutoring
systems (ITS) where the system regards the user as an apprentice
to help them develop new skills [
        <xref ref-type="bibr" rid="ref11 ref13 ref3">3, 11, 13</xref>
        ].
      </p>
      <p>The apprenticeship model provides a useful metaphor for
aligning our system’s three stages of data gathering and
application with an accessible, real-world process. Our system
swaps the human-computer relationship typical of ITS and
instead considers the user the expert and the AI-assistant
persona the apprentice. By adopting this framework of model
training as task modeling; model tuning and validation as
coaching and model acceptance as fading, we help the user
to better understand the expected interactions and
responsibilities.</p>
      <p>
        Our coupling of apprenticeship with machine learning is
distinct from the use of apprenticeship in reinforcement
learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which does not have an interactive human element.
6
      </p>
      <p>REDESIGNED USER INTERFACE AND FLOW
Based on the feedback from our initial pilot, in our new
threephase apprenticeship approach, we encourage instructors to
view the process of training the automated scoring system as
an apprenticeship. In this view, they can reasonably expect the
AI-assistant to continue to learn even after it starts grading.
The instructor now expects mistakes to continue, but in
diminishing number and severity over time. With this approach
minor mistakes are less likely to damage trust in the system.</p>
      <p>In the redesigned UI, at the top of each screen a large circle
indicates the current location in the process, with messages
updating to keep the instructor informed of progress within
a given phase. At the beginning of the apprentice modeling
phase (Figure 2), instructors are encouraged to score essays to
help train the AI grading assistant. As they score more essays
they see progress updates as shown in Figure 3. When the AI
grading assistant achieves the ability to initially begin scoring,
instructors move into the coaching phase (Figure 4).</p>
      <p>In this phase the instructor scores and then compares their
score to the AI-assistant (Figure 5). This phase ends when
the system gains sufficient confidence in the model’s
performance (e.g., 0.7 to 0.8, values similar to human inter-rater
reliability). This is reflected in the instructor’s view by
showing the instructor’s corrections diminish (Figure 6). Once
in the fading phase (Figure 7), the instructor passes
scoring to the AI-assistant, but still retains the ability to review
the AI-assistant’s scores. Messages reinforce the
relationship between additional scoring and performance, making
the apprentice-relationship of the assistant (e.g., ". . . the more
you grade, the more the assistant learns from you") more
transparent. The level of the assistant’s learning is indicated
by the number and percentage of agreements compared to
disagreements with the instructor’s grades, and by the progress
bar. The progress indicated by the bar follows the number
of essays scored, but can accelerate as model performance
improves.
7</p>
      <p>CONCLUSIONS AND FUTURE WORK
As more people interact with systems based on sophisticated,
often opaque algorithms, it becomes ever more critical to
develop common languages and appropriate metaphors to
allow communication and common understanding. Often, as
these systems move to increasingly common use, a more
reifned understanding of how the machine learning component
is trained and what its limitations are, becomes lost. In our
ifrst pilot we adopted a quite reasonable view of training an
automated scoring system as a tool. Our first set of instructors
internalized this model with unexpected consequences both
for their performance on the task and their satisfaction with
completing the task. In moving to the apprentice model, we
believe we have found a metaphor that ameliorates some of
these issues.</p>
      <p>Our next steps include conducting pilots with this new
metaphor and a UI/UX that supports it. We have begun to
think more broadly about the complex relationships between
clever systems and equally clever people, both of which have
large blind spots. The instructors know the domain area but
may have less experience in the type of reliable scoring
required to train an automated scoring model. The AI system
embodies knowledge about scoring that can be used to
scaffold the instructor’s scoring, but at the same time is an
apprentice to how the instructor wants the prompt evaluated. How to
share the task and how the two agents should communicate
are interesting, open questions. These questions will become
even more relevant as we will begin testing the complete
system which will now include instructors authoring prompts and
replacing logistic regression with a full modeling pipeline.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Apprenticeship Learning via Inverse Reinforcement Learning</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Machine Learning.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Gavin</surname>
            <given-names>TL Brown</given-names>
          </string-name>
          , Kath Glasswell, and Don Harland. [n. d.].
          <article-title>Accuracy in the scoring of writing: Studies of reliability and validity using a New Zealand writing assessment system</article-title>
          .
          <source>Assessing writing 9</source>
          ,
          <issue>2</issue>
          ([n. d.]),
          <fpage>105</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>John</given-names>
            <surname>Seely</surname>
          </string-name>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , R. R Burton,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Bell</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>SOPHIE: A Step toward creating a reactive learning environment</article-title>
          .
          <source>International Journal of Man-Machine Studies 7</source>
          ,
          <issue>5</issue>
          (Sept.
          <year>1975</year>
          ),
          <fpage>675</fpage>
          -
          <lpage>696</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jing</given-names>
            <surname>Chen</surname>
          </string-name>
          , James Fife,
          <string-name>
            <given-names>H</given-names>
            ,
            <surname>Issac</surname>
          </string-name>
          <string-name>
            <given-names>I. Bejar</given-names>
            , and
            <surname>André</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rupp</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Building e-rater® Scoring Models Using Machine Learning Methods</article-title>
          .
          <source>ETS Research Report Series</source>
          <year>2016</year>
          .
          <volume>1</volume>
          (
          <issue>2016</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Brown</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Newman</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>Cognitive apprenticeship: Teaching the craft of reading, writing and mathematics</article-title>
          .
          <source>Technical 403. Centre for the Study of Reading</source>
          , University of Illinois, BBN Laboratories, Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nicholas</given-names>
            <surname>Dronen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter W. Foltz</surname>
            , and
            <given-names>Kyle</given-names>
          </string-name>
          <string-name>
            <surname>Habermehl</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Effective Sampling for Large-</article-title>
          scale
          <source>Automated Writing Evaluation Systems. Proceedings of the Second</source>
          (
          <year>2015</year>
          ) ACM Conference on Learning @ Scale (
          <year>2015</year>
          ),
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          . https://doi.org/10.1145/2724660. 2724661
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Youmna</given-names>
            <surname>Farag</surname>
          </string-name>
          , Helen Yannakoudakis, and
          <string-name>
            <given-names>Ted</given-names>
            <surname>Briscoe</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Neural automated essay scoring and coherence modeling for adversarially crafted input</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>06898</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Horbach</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Palmer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Investigating Active Learning for Short-Answer Scoring</article-title>
          .
          <source>In BEA@ NAACL-HLT</source>
          .
          <fpage>301</fpage>
          -
          <lpage>311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Anders</given-names>
            <surname>Jonsson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gunilla</given-names>
            <surname>Svingby</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The use of scoring rubrics: Reliability, validity and educational consequences</article-title>
          .
          <source>Educational research review 2</source>
          ,
          <issue>2</issue>
          (
          <year>2007</year>
          ),
          <fpage>130</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ronald</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Kennard</surname>
            and
            <given-names>Larry A.</given-names>
          </string-name>
          <string-name>
            <surname>Stone</surname>
          </string-name>
          .
          <year>1969</year>
          .
          <article-title>Computer Aided Design of Experiments. Technometrics: A Journal of Statistics for the Physical</article-title>
          , Chemical, and
          <source>Engineering Sciences</source>
          <volume>11</volume>
          ,
          <issue>1</issue>
          (
          <year>1969</year>
          ),
          <fpage>137</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Lajoie</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Lesgold</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Dynamic assessment of proficiency for solving procedural knowledge tasks</article-title>
          .
          <source>Educational Psychologist</source>
          <volume>27</volume>
          (
          <year>1992</year>
          ),
          <fpage>365</fpage>
          -
          <lpage>384</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lave</surname>
          </string-name>
          . [n. d.].
          <article-title>Tailored learning: Education and everyday practice among craftsmen in West Africa</article-title>
          .
          <source>Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lesgold</surname>
          </string-name>
          , G. Eggan, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Rao</surname>
          </string-name>
          .
          <year>1992</year>
          .
          <article-title>Possibilities for assessment using computer-based apprenticeship environments</article-title>
          . Cognitive approaches to automated instruction W. Regian &amp; V.
          <string-name>
            <surname>Shute</surname>
          </string-name>
          (Eds.) (
          <year>1992</year>
          ),
          <fpage>49</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Danielle</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McNamara</surname>
            ,
            <given-names>Scott A.</given-names>
          </string-name>
          <string-name>
            <surname>Corssley</surname>
            ,
            <given-names>Rod D.</given-names>
          </string-name>
          <string-name>
            <surname>Roscoe</surname>
          </string-name>
          ,
          <string-name>
            <surname>Laura K. Allen</surname>
            , and
            <given-names>Jianmin</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A hierarchical classification approach to automated essay scoring</article-title>
          .
          <source>Assessing Writing</source>
          <volume>23</volume>
          (
          <year>2015</year>
          ),
          <fpage>35</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Miles</given-names>
            <surname>Myers</surname>
          </string-name>
          .
          <year>1980</year>
          .
          <article-title>A procedure for writing assessment and holistic scoring</article-title>
          .
          <source>National Council of Teachers of English</source>
          , Urbana, IL.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ellis</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Page</surname>
          </string-name>
          .
          <year>1967</year>
          .
          <article-title>Statistical and linguistic strategies in the computer grading of essays</article-title>
          .
          <source>Coling</source>
          <year>1967</year>
          :
          <article-title>Conférence Internationale sur le Traitement Automatique des Langues</article-title>
          , Grenoble, France (
          <year>1967</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Phandi</surname>
          </string-name>
          ,
          <article-title>Kian Ming A</article-title>
          .
          <string-name>
            <surname>Chai</surname>
          </string-name>
          , and Hwee Tou Ng.
          <year>2015</year>
          .
          <article-title>Flexible domain adaptation for automated essay scoring using correlated linear regression</article-title>
          .
          <source>In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Mark</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Shermis</surname>
          </string-name>
          and Jill C. Burstein (Eds.).
          <year>2003</year>
          .
          <article-title>Automated essay scoring: A cross-disciplinary perspective</article-title>
          . Lawrence Erlbaum Associates, Inc., Mahway, NJ.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Mark</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Shermis</surname>
          </string-name>
          and Jill C. Burstein (Eds.).
          <year>2013</year>
          .
          <article-title>Handbook of automated essay evaluation: Current applications and new directions</article-title>
          . Routledge, New York.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Dannelle</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Stevens</surname>
            and
            <given-names>Antonia J.</given-names>
          </string-name>
          <string-name>
            <surname>Levi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Introduction to Rubrics: An Assessment Tool to Save Grading Time, Convey Effective Feedback, and Promote Student Learning</article-title>
          .
          <source>Stylus Publishing</source>
          , LLC.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Kaveh</given-names>
            <surname>Taghipour</surname>
          </string-name>
          and Hwee Tou Ng.
          <year>2016</year>
          .
          <article-title>A neural approach to automated essay scoring</article-title>
          .
          <source>In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Katrin</given-names>
            <surname>Tomanek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Udo</given-names>
            <surname>Hahn</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Reducing Class Imbalance During Active Learning for Named Entity Annotation</article-title>
          .
          <source>In Proceedings of the Fifth International Conference on Knowledge Capture (K-CAP '09)</source>
          . ACM, New York, NY, USA,
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          . https://doi.org/ 10.1145/1597735.1597754
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Edward</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Wolfe</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Identifying rater effects using latent trait models</article-title>
          .
          <source>Psychology Science</source>
          <volume>46</volume>
          (
          <year>2004</year>
          ),
          <fpage>35</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>