Toward Better Training in Peer Assessment: Does Calibration
                                Help?
       Yang Song, Zhewei Hu,                                  Julia Morris, Jennifer Kidd                           Stacie Ringleb
        Edward F. Gehringer                                                                                  Department of Mechanical &
   Department of Computer Science                              Darden College of Education                     Aerospace Engineering
    North Carolina State University                              Old Dominion University                      Old Dominion University
          Raleigh, NC, U.S.                                         Norfolk, VA, U.S.                             Norfolk, VA, U.S.
{ysong8, zhu6, efg}@ncsu.edu                                {jmorr005, jkidd}@odu.edu                            sringleb@odu.edu


ABSTRACT                                                                           improve it. When asked to rate it on a Likert scale, they gravitate
For peer assessments to be helpful, student reviewers need to                      to the upper end of the scale, making little distinction between the
submit reviews of good quality. This requires certain training or                  various artifacts that they review [2].
guidance from teaching staff, lest reviewers read each other’s                     One approach to improving the quality of peer review is to
work uncritically, and assign good scores but offer few                            interpose a calibration phase before the actual peer-review task.
suggestions. One approach to improving the review quality is                       “Calibration” refers to having students evaluate sample artifacts
calibration. Calibration refers to comparing students’ individual                  that have already been rated by teaching staff. Then the online
reviews to a standard—usually a review done by teaching staff on                   peer-review system can use the comparison between students’
the same reviewed artifact. In this paper, we categorize two modes                 reviews and those of the teaching staff to calculate review
of calibration for peer assessment and discuss our experience with                 proficiency values for students. This approach was pioneered in
both of them in a pilot study with Expertiza system.                               Calibrated Peer Review ™ [3], [4] and later adopted by other
                                                                                   systems as well (such as Coursera [5], EduPCR5.8 [6], Expertiza
Keywords                                                                           [7], Mechanical TA [8], Peerceptiv [9] and Peergrade.io).
Educational peer review; peer assessment; calibration.
                                                                                   2. TWO MODES OF CALIBRATION
                                                                                   We can divide calibrations into two modes. The first mode
1. INTRODUCTION                                                                    separates the calibration from actual peer-review assignments, in
Writing assignments are used across the curriculum because they                    which students rate and on comment each other's work. We call
hone communication skills and teach critical thinking.                             this stand-alone calibration. An example is Calibrated Peer
Unfortunately, they impose a considerable grading burden since it                  Review™. A calibrated assignment has a separate calibration
is time consuming to give good feedback on writing. Many                           phase in which students need to rate three sample artifacts, one of
instructors may turn to computer-supported peer-review systems                     which is exemplary, and the other two of which have known
for help; indeed, reviewing writing was the motivation behind                      defects. The system uses their ratings to calculate the Reviewer
long-lived peer-assessment systems like the Daedalus Integrated                    Competency Index, which is a measure of the student’s review
Writing Environment and Calibrated Peer Review™.                                   proficiency [3], [4]. The motivation for this mode of calibration is
In educational peer-review systems, students submit their artifacts                to train students to become proficient reviewers first before they
and other students rate and/or give comments on artifacts                          start to review each other’s artifacts. The resultant peer-review
submitted by their peers. Previous research has shown that this                    grades should have greater validity and thereby, make grading
process benefits both reviewers and reviewees. The reviewers                       easier for the teaching staff.
benefit by seeing others’ work and thinking metacognitively about                  The other mode of calibration combines the calibration with
how they can improve their own work. The reviewees profit from                     ordinary peer-review activity. In the peer-review phase, students
receiving comments and advice from their classmates. That                          review both sample artifacts and artifacts submitted by their peers.
feedback is both more timely and more copious than feedback                        Usually, they are not aware of whether the artifact is a sample for
from teaching staff [1].                                                           calibration or an actual peer submission. We call this approach
The efficacy of peer assessment depends heavily on the quality of                  mixed calibration. An example is the Coursera system [5]. In a
the reviewing. Left to their own devices, students tend to examine                 calibrated assignment, the teaching staff grades only a small
peers’ work uncritically, and make few suggestions on how to                       number of artifacts, which are then used as sample artifacts in the
                                                                                   peer-review phase. When doing peer review, each student
                                                                                   evaluates four random artifacts and one sample artifact that has
 The Peerlogic project is funded by the National Science Foundation under grants
 1432347, 1431856, 1432580, 1432690, and 1431975.                                  already been graded by teaching staff. Just as in stand-alone
                                                                                   calibration, the review proficiency is determined by agreement on
                                                                                   the sample artifacts between students and teaching staff.
                                                                                   Comparing these two modes of calibration, we observe that stand-
                                                                                   alone calibration requires more work for teaching staff: they need
                                                                                   to locate sample artifacts (which they could take from earlier
                                                                                   semesters) and set up a calibration phase in the assignment.
                                                                                   Students are aware of the fact that they are rating some sample
artifacts, so they may pay more attention than they do in the actual       by ranking the (1) importance, (2) interest, (3) credibility, (4)
peer-review tasks, which also makes it harder to test the efficacy         effectiveness, and (5) writing quality of the lesson. They
of the calibration. However, stand-alone calibration fits in well          were asked to consider what was effective and ineffective in
with in-class lecture. Instructors can give students time to do the        each lesson based on the strengths and weakness they
calibration in class as training. They can also explain how the            identified from the rubric. The artifacts were lessons created
rating was done on sample artifacts so that students may have a            by students of prior semesters whose lessons exemplified
better understanding of the rating rubrics.                                both noteworthy achievements and pitfalls. By evaluating
                                                                           these two lessons, students gain valuable insight into the act
Mixed calibration does not emphasize training — to make
                                                                           of evaluating peers’ writing and are provided with a model
students better peer-reviewers — but score aggregation — how to
                                                                           to guide their own submissions. The students’ completed the
identify the good reviewers and use their peer-review responses to
                                                                           calibration assignment, ranking each of the rubric categories
aggregate grades for each artifact. Therefore, students who did
                                                                           on a 1-5 scale. Their results were then compared with the
poorly on the peer-review do not receive any pedagogical
                                                                           “expert” review completed by the course instructor.
intervention, though their identities are known. So the mixed
calibration is used more often by classes of massive sizes, e.g.          Assignment 2: Course: Project Design and Management I;
some courses in the Coursera system.                                       Assignment: Practice Introduction to Peer Review. This
                                                                           assignment was designed to expose students to writing an
2.1 Calibration in Expertiza                                               introduction for their senior project, to orient them to the
Beginning in 2016, the Expertiza system has included a                     peer review process, and to understand the instructor’s
calibration feature, which supports both stand-alone calibration           expectations for the peer review assignment. The calibration
and mixed calibration. In setting up an assignment, an instructor          exercise had the students peer review two introductions
can designate an assignment as a calibrated assignment, and                from a previous class, one with a good grade and one that
submit sample artifacts and “expert” reviews. The instructor can           received a poor grade. The calibration exercise was
give students the right to do reviews, but not submit work. This           performed before the introduction was drafted. The general
makes the assignment a stand-alone calibration assignment.                 introduction assignment included a draft with an in class
(Ordinarily, students are permitted both to submit and to review.)         peer review, a second draft peer review using Expertiza and
                                                                           the submission of a final draft.
The review was done in double-blind style in Expertiza. In neither
calibration mode did student reviewers see the expert review              Assignment 3: Course: Object-Oriented Design and
before they finished reviewing an artifact. But, after a student           Development; Assignment: Calibration for reviewing
finishes reviewing an artifact that is a calibration sample that has       Wikipedia pages. This assignment was to get the students
been reviewed by the instructor, Expertiza shows a comparison              ready to write and peer-review Wikipedia entries. The
between the student’s review and the expert review (see Figure 1           instructor provided a list of topics on recent software-
for an example). No update is allowed after the expert review is           development techniques, frameworks, and products. Some
displayed.                                                                 of these topics had pre-existing Wikipedia pages; some did
                                                                           not. Where the pages existed, they were stubs or otherwise
                                                                           in need of improvement. Students could choose one topic
                                                                           and create the corresponding page. Then students were
                                                                           required to review at least two others’ artifacts and provide
                                                                           both textual feedback and ratings.
                                                                           We created a separate assignment for calibration. The
                                                                           sample artifacts were chosen from a previous semester. The
                                                                           instructor took two reviews done by good reviewers and
                                                                           made further changes in an effort to make the review of
                                                                           exemplary quality.
                                                                          Assignment 4: Course: Object-Oriented Design and
                                                                           Development; Assignment: create and review CRC (Class-
 Figure 1. Comparison page of between student’s review and
                                                                           responsibility-collaborator) cards. CRC cards are an
                      expert review
                                                                           approach to designing object-oriented software. The
                                                                           instructor’s students tended to make the same mistakes,
3. ASSIGNMENT DESIGN                                                       semester after semester. The goals of this calibration
Three instructors at two universities set up a total of four
                                                                           assignment were to (1) allow students to submit their own
calibration assignments using Expertiza. Those assignments used
                                                                           CRC-card design and (2) review some CRC-card designs
calibration feature in Spring 2016 but did not have calibration in
                                                                           that contained common mistakes. In this assignment, each
Fall 2015. Other than the calibration, those four assignments were
                                                                           student reviewed one of their peers’ designs, and two
of the same settings including review rubrics.
                                                                           designs arranged by the instructor to contain common
    Assignment 1: Course: Foundations and Introduction to                 mistakes. These designs were created by merging the errors
     Assessment of Education; Assignment: Grade Sample                     made by previous students on an exam.
     Lessons. This assignment was a precursor to engage
                                                                           Unlike the other three calibration assignments, this
     students in evaluating peers’ writing before they assessed
                                                                           assignment did not precede another assignment where the
     each other’s work. Pre-service teachers were asked to grade
                                                                           students submitted their own work. Rather, it was done as
     two different example lesson plans with a five-item rubric
                                                                           practice for the next exam.
We asked the instructors to identify a few good reviewers in the              of words. The Flesch-Kincaid readability index rates work
actual peer-review assignments of exemplary quality to compare                between 0 (difficult to read) and 100 (easy to read).
the student performance on the calibration assignment and the                 Conversational English is usually between 80 and 90 on this
actual assignments for which they received training. To test                  index. Text is considered to be hard to read (usually
student performance on different assignments, we used the metrics             requiring a college education or higher) if the index is lower
below:                                                                        than 50.
    Percentage of exact agreement on each criterion. All the
     rubrics used in our experiments were scored on either a 0-
                                                                        4. HOW CALIBRATION AFFECTS
     to-5 or a 1-to-5 scale. On each criterion, exact agreement         STUDENT PERFORMANCE
     was when instructor and student gave exactly the same score.       4.1 Results for stand-alone calibration
    Percentage of adjacent agreement on each criterion. On each        The first three calibration assignments (Assignment 1, 2 and 3)
     criterion, adjacent agreement means that the score assigned        were followed by an actual assignment where the students carried
     by the student is within ±1 of the instructor’s score.             out the same kind of review on which they were calibrated. We
                                                                        measured the percentage of empty comments, average comment
    Percentage of empty comment boxes. Some criteria asked             length, and number of constructive comments in the response to
     students to give both a score and textual feedback. In the         each criterion, and the overall readability. In the following actual
     calibration, the instructors tried to give textual feedback on     assignment, we also measured the students’ agreement on
     all these criteria. If the sample artifact was in good shape,      exemplary reviews (done by students). The results are shown in
     the instructors commented why it was good; otherwise, if           Table 1.
     the sample artifact needed improvement, the instructors
     suggested changes for the author to consider. We hoped this        In all three classes, we found there was a similar amount of exact
     would encourage students to comment on more of the                 agreement on calibration assignments and following assignment.
     criteria.                                                          But we observed increases in the adjacent agreement on the
                                                                        following assignment. The reason for that could be that the
    Average non-empty comment length. We counted the words             calibration phase led students to become more skilled and more
     in the non-empty responses. In calibration, the expert             polite as reviewers. The instructor of assignment 1 observed that
     reviews were usually longer than the average of students’          her students were critical or even bullying, in their peer reviews at
     review (see Figure 1 for example).                                 the very beginning of the semester. In the calibration phase,
    Average of number constructive comments. We tried to               students were able to see how the instructor reacted to various
     measure how much constructive content was provided in the          issues and what the instructor grades were. This gave students
     non-empty responses. We used the same constructive                 guidance on how to rate artifacts that still needed improvement.
     lexicon used by Hsiao and Naveed [10], [11]. This lexicon         We also noted that the percentage of empty comments dropped
     focuses mainly on assessment, emphasis, causation,                between the calibration assignment and the assignment right after,
     generalization, and conditional sentence patterns.                indicating students were more willing to give comments after the
                                                                       calibration. Relative to the previous semester, two of the three
    Readability. We used the Flesch-Kincaid readability index
                                                                       classes had a lower empty-comment percentage on corresponding
     [12], which considers the length of sentences and the length
                                                                       assignments.
  Table 1. Metrics for calibration assignments, the assignments following the calibration assignment, and the corresponding actual
                                                  assignment in the previous semester

                                                                                            Avg.
                                                                                                          Avg.
                                                                                            non-
                                            Exact           Adjacent        Empty                       number of
                          Assignment                                                       empty                        Readability
                                         agreement %      agreement %     comment %                    constructive
                                                                                          comment
                                                                                                        comments
                                                                                           length

                           Assgt. 1         53.20%           83.80%         31.80%           17.4           0.35            58.9
         Calibration
                           Assgt. 2         21.60%           32.10%         17.40%           22.1           0.31            49.8
         assignment
                           Assgt. 3         45.90%           85.80%         11.20%            18            0.27            54.4

         Assignment        Assgt. 1         48.00%           86.70%         26.80%           21.8           0.44            63.2
        right after the
                           Assgt. 2         26.70%           61.70%         13.20%           21.2           0.35            50.8
          calibration
          assignment       Assgt. 3         49.10%           92.00%          8.50%           14.4           0.25            55.9
       Corresponding       Assgt. 1           N/A              N/A          20.80%           18.3           0.36            62.6
            actual
                           Assgt. 2           N/A              N/A          15.10%            28            0.48            51.5
         assignment
        from former
          semester         Assgt. 3           N/A              N/A          46.10%           8.6            0.14            57.2
The comment length between the calibration assignment and the           artifact. The green color highlights the expert grade, and the
following assignment were almost the same. Two out of three             bolded number was the plurality of students’ grades.
classes had a higher average comment length after they did
                                                                        Figure 2 shows a sample artifact where the calibration was quite
calibration, compared with corresponding assignments last
                                                                        successful, with exact agreement of more than 40% and adjacent
semester.
                                                                        agreement of almost 80%. However, it is still not clear that if it
From the amount of constructive content per response to each            was related to the quality of the artifact. When we calculate the
criterion, we found that the students tended to give as many or         percentages of agreements for each sample artifacts, we found that
more constructive comments in the peer-review after the                 the level of agreement is related to the quality of the artifact: the
calibration. Two out of three classes made more constructive            higher grade that a sample had, the higher agreement that students
comments after calibration compared with corresponding                  might achieve. This raises another question: what kind of artifacts
assignments last semester.                                              work better as samples in calibration?
In this study, we found that students tended to write more
complicated sentences in calibration tasks, but in the assignments
right after the calibration, their comments were a little easier to
read but close to college level, which was acceptable to instructors.

4.2 Results for mixed calibration
Assignment 4 was our only experiment with the mixed calibration
mode: each student reviewed two calibration submissions and one
submission from their classmates. Unlike Assignments 1–3, which
aimed to train students to become better reviewers on the actual
peer assessment, Assignment 4 was not followed with an “actual”
assignment on the same topic. Instead, Assignment 4 was
designed to give students the opportunity to see common mistakes
                                                                              Figure 2. A calibration report on Expertiza system
that others had made on a certain kind of question (on CRC-card
design) on exams in earlier semesters.                                  We put the percentages of agreement and grades for the artifacts
                                                                        together to compare the relationship between the agreement and
On Assignment 4, the percentage of exact agreement was 52.2%            the grades that the sample artifacts received. We used both the
and percentage of adjacent agreement was 91.3%, which were              sample artifacts and the artifacts reviewed by the exemplary
both very high. This was partially due to a review rubric that          reviewers. The distribution and fit line are shown below.
asked students to count the number of errors of certain types (e.g.
the number of class names that are not singular nouns), instead of
ordinary rubric criteria that ask students to rate the artifact on
some aspect (e.g., the language usage of an article). This rubric
design reduces ambiguity and thereby increased the agreements.
The percentage of the empty comment was 77.0%, the average of
non-empty comment length was 5.4 and average of number
constructive comments was 0.13, which are all lower than
Assignment 1-3. The ostensible reason was that the review rubric
was not designed to encourage students to give textual comments,
but simply to count the errors. The review readability index was
60.1, which indicates that for those reviewers who gave textual
feedback, the feedback was not short and simple as we expected.
We hypothesized that after this calibration, student's’ average
                                                                        Figure 3. Relationship between adjacent agreement percentage
score on related questions on the exam would be higher. We
                                                                                               and sample grade
compared the student performance on CRC-card related questions
in exams of this semester (with calibration as training) and last
semester (without training). However, we found that the students’
average grade was 85.3% on those questions in this semester, and
85.4% on last semester. We did not find any significant change
between this semester and last semester. Upon seeing those results,
we surmised this calibration assignment was done several weeks
before the next exam, and, without follow-up practice, students
forgot the training they received.

5. WHAT SAMPLE ARTIFACTS WE
SHOULD USE FOR CALIBRATION?
After students finish the calibration, the instructor can see the
calibration reports for each artifact, as shown in Figure 2. Each
table shows the students’ grades on each question on a sample                 Figure 4. Relationship between the exact agreement
                                                                                         percentage and sample grade
We find that the samples that received higher grades usually have        low-quality artifacts as samples and (2) the instructor can provide
higher levels of agreement (on both exact agreement and adjacent         “advice” for each level of each criterion.
agreement). The lower quality a sample is, the lower agreement           One future study we are interested in is to calibrate the textual
we observed between teaching staff and students.                         feedback. In this paper, we have only calibrated the numerical
We looked into the samples used in each assignment, and we               scores. It is possible that both a student and the teaching staff
found that usually it is harder for students to make the same            gave a ⅘ on one criterion on a sample artifact, but may not see
judgment as teaching staff on an artifact of low quality. There          the same issue. This kind of agreement can only be measured by
could be multiple reasons. The first reason is that teaching staff       calibration of textual feedback.
has seen more artifacts, therefore they know the distribution of the
quality of the artifacts and thereby they made better judgments.         7. REFERENCES
For student reviewers, they may be able to tell an artifact is of low    [1]    E. F. Gehringer, “A Survey of Methods for Improving
quality based on one criterion, but they could be more critical                 Review Quality,” in New Horizons in Web Based Learning,
than warranted since they have not seen even worse examples.                    Y. Cao, T. Väljataga, J. K. T. Tang, H. Leung, and M.
From this perspective, it is important for instructors to use at least          Laanpere, Eds. Springer International Publishing, 2014, pp.
one or two low-quality sample artifact as a sample artifact to show             92–97.
students how to rate poor work.                                          [2]    Y. Song, Z. Hu, and E. F. Gehringer, “Closing the Circle:
Another factor that may lower the agreement between teaching                    Use of Students’ Responses for Peer-Assessment Rubric
staff and students is the reliability of the criterion: some of the             Improvement,” in Advances in Web-Based Learning --
criteria are not specific enough for the reviewers to make reliable             ICWL 2015, F. W. B. Li, R. Klamma, M. Laanpere, J.
judgments [2]. E.g. the criterion, “(On Likert scale) does the                  Zhang, B. F. Manjón, and R. W. H. Lau, Eds. Springer
author provide enough examples in this article?” is not reliable,               International Publishing, 2015, pp. 27–36.
since “enough” is not well defined. To improve review rubrics,           [3]    R. Robinson, “Calibrated Peer ReviewTM,” Am. Biol.
instructors can create “advice” for each level (sometimes known                 Teach., vol. 63, no. 7, pp. 474–480, Sep. 2001.
as an “anchored scale”). For example, “⅕ - No example                    [4]    A. Russell, “Calibrated peer review-a writing and critical-
provided”, etc. From this perspective, the calibration can also be              thinking instructional tool,” in Teaching Tips: Innovations
used to test the instructor’s review rubric.                                    in Undergraduate Science Instruction, 2004, p. 54.
                                                                         [5]    C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller,
6. CONCLUSION                                                                   “Tuned Models of Peer Assessment in MOOCs,”
In this paper, we have described our experience with the                        ArXiv13072579 Cs Stat, Jul. 2013.
calibration in peer assessment in Expertiza. We first introduced         [6]    Y. Wang, Y. Jiang, M. Chen, and X. Hao, “E-learning-
two modes of calibration that have been used in online peer                     oriented incentive strategy: Taking EduPCR system as an
assessment systems, which are stand-alone calibration and mixed                 example,” World Trans. Eng. Technol. Educ., vol. 11, no.
calibration. Stand-alone calibration trains students to become                  3, pp. 174–179, Nov. 2013.
better reviewers, while mixed calibration finds credible reviewers       [7]    E. Gehringer, “Expertiza: information management for
in the course of performing peer assessment. We also discussed                  collaborative learning,” Monit. Assess. Online Collab.
the pedagogical scenario in which each mode is suitable.                        Environ. Emergent Comput. Technol. E-Learn. Support,
                                                                                pp. 143–159, 2009.
We calculated the agreement between students’ rating and
                                                                         [8]    J. R. Wright, C. Thornton, and K. Leyton-Brown,
teaching staff’s rating on the sample artifacts. We found that
                                                                                “Mechanical TA: Partially Automated High-Stakes Peer
students in our assignments, on average agreed exactly with
                                                                                Grading,” in Proceedings of the 46th ACM Technical
teaching staff on more than 40% of ratings. This means that on
                                                                                Symposium on Computer Science Education, New York,
more than 40% of the ratings done by students during calibration
                                                                                NY, USA, 2015, pp. 96–101.
gave exactly the same scores given by teaching staff. In addition,
                                                                         [9]    C. Schunn, A. Godley, and S. DeMartino, “The Reliability
more than 70% of the ratings done by students gave the score
                                                                                and Validity of Peer Review of Writing in High School
within the ±1 range to the scores given by teaching staff. To test if
                                                                                AP English Classes,” J. Adolesc. Adult Lit., p. n/a-n/a, Apr.
students still perform as well on the actual peer assessment after
                                                                                2016.
training, we asked the teaching staff to identify some good
                                                                         [10]   I. H. Hsiao and F. Naveed, “Identifying learning-inductive
reviewers in each course. Using their reviews as exemplars, we
                                                                                content in programming discussion forums,” in IEEE
found that, in the actual peer assessment phases, the agreement
                                                                                Frontiers in Education Conference (FIE), 2015. 32614
was similar to that on the calibration assignments, sometime even
                                                                                2015, 2015, pp. 1–8.
a little higher.
                                                                         [11]   Y. Song, Z. Hu, Y. Guo, and E. Gehringer, “An
We compared the volume of textual feedback from the semester                    Experiment with Separate Formative and Summative
with calibration and the previous semester without calibration. We              Rubrics in Educational Peer Assessment,” in Submitted to
found that after calibration, students tend to give more extensive              IEEE Frontiers in Education Conference (FIE), 2016,
textual feedback, fill in more text boxes with comments, and give               2016.
more constructive feedback.                                              [12]   J. P. Kincaid and A. Others, “Derivation of New
We also found that the level of rating agreement between students               Readability Formulas (Automated Readability Index, Fog
and teaching staff is related to the quality of the artifact; namely            Count and Flesch Reading Ease Formula) for Navy
students tended to agree less with teaching staff on artifacts of low           Enlisted Personnel.,” Feb. 1975.
quality. To improve agreement, we suggested: (1) on the
calibration, an instructor can use both median-quality artifacts and