Toward Better Training in Peer Assessment: Does Calibration Help? Yang Song, Zhewei Hu, Julia Morris, Jennifer Kidd Stacie Ringleb Edward F. Gehringer Department of Mechanical & Department of Computer Science Darden College of Education Aerospace Engineering North Carolina State University Old Dominion University Old Dominion University Raleigh, NC, U.S. Norfolk, VA, U.S. Norfolk, VA, U.S. {ysong8, zhu6, efg}@ncsu.edu {jmorr005, jkidd}@odu.edu sringleb@odu.edu ABSTRACT improve it. When asked to rate it on a Likert scale, they gravitate For peer assessments to be helpful, student reviewers need to to the upper end of the scale, making little distinction between the submit reviews of good quality. This requires certain training or various artifacts that they review [2]. guidance from teaching staff, lest reviewers read each other’s One approach to improving the quality of peer review is to work uncritically, and assign good scores but offer few interpose a calibration phase before the actual peer-review task. suggestions. One approach to improving the review quality is “Calibration” refers to having students evaluate sample artifacts calibration. Calibration refers to comparing students’ individual that have already been rated by teaching staff. Then the online reviews to a standard—usually a review done by teaching staff on peer-review system can use the comparison between students’ the same reviewed artifact. In this paper, we categorize two modes reviews and those of the teaching staff to calculate review of calibration for peer assessment and discuss our experience with proficiency values for students. This approach was pioneered in both of them in a pilot study with Expertiza system. Calibrated Peer Review ™ [3], [4] and later adopted by other systems as well (such as Coursera [5], EduPCR5.8 [6], Expertiza Keywords [7], Mechanical TA [8], Peerceptiv [9] and Peergrade.io). Educational peer review; peer assessment; calibration. 2. TWO MODES OF CALIBRATION We can divide calibrations into two modes. The first mode 1. INTRODUCTION separates the calibration from actual peer-review assignments, in Writing assignments are used across the curriculum because they which students rate and on comment each other's work. We call hone communication skills and teach critical thinking. this stand-alone calibration. An example is Calibrated Peer Unfortunately, they impose a considerable grading burden since it Review™. A calibrated assignment has a separate calibration is time consuming to give good feedback on writing. Many phase in which students need to rate three sample artifacts, one of instructors may turn to computer-supported peer-review systems which is exemplary, and the other two of which have known for help; indeed, reviewing writing was the motivation behind defects. The system uses their ratings to calculate the Reviewer long-lived peer-assessment systems like the Daedalus Integrated Competency Index, which is a measure of the student’s review Writing Environment and Calibrated Peer Review™. proficiency [3], [4]. The motivation for this mode of calibration is In educational peer-review systems, students submit their artifacts to train students to become proficient reviewers first before they and other students rate and/or give comments on artifacts start to review each other’s artifacts. The resultant peer-review submitted by their peers. Previous research has shown that this grades should have greater validity and thereby, make grading process benefits both reviewers and reviewees. The reviewers easier for the teaching staff. benefit by seeing others’ work and thinking metacognitively about The other mode of calibration combines the calibration with how they can improve their own work. The reviewees profit from ordinary peer-review activity. In the peer-review phase, students receiving comments and advice from their classmates. That review both sample artifacts and artifacts submitted by their peers. feedback is both more timely and more copious than feedback Usually, they are not aware of whether the artifact is a sample for from teaching staff [1]. calibration or an actual peer submission. We call this approach The efficacy of peer assessment depends heavily on the quality of mixed calibration. An example is the Coursera system [5]. In a the reviewing. Left to their own devices, students tend to examine calibrated assignment, the teaching staff grades only a small peers’ work uncritically, and make few suggestions on how to number of artifacts, which are then used as sample artifacts in the peer-review phase. When doing peer review, each student evaluates four random artifacts and one sample artifact that has The Peerlogic project is funded by the National Science Foundation under grants 1432347, 1431856, 1432580, 1432690, and 1431975. already been graded by teaching staff. Just as in stand-alone calibration, the review proficiency is determined by agreement on the sample artifacts between students and teaching staff. Comparing these two modes of calibration, we observe that stand- alone calibration requires more work for teaching staff: they need to locate sample artifacts (which they could take from earlier semesters) and set up a calibration phase in the assignment. Students are aware of the fact that they are rating some sample artifacts, so they may pay more attention than they do in the actual by ranking the (1) importance, (2) interest, (3) credibility, (4) peer-review tasks, which also makes it harder to test the efficacy effectiveness, and (5) writing quality of the lesson. They of the calibration. However, stand-alone calibration fits in well were asked to consider what was effective and ineffective in with in-class lecture. Instructors can give students time to do the each lesson based on the strengths and weakness they calibration in class as training. They can also explain how the identified from the rubric. The artifacts were lessons created rating was done on sample artifacts so that students may have a by students of prior semesters whose lessons exemplified better understanding of the rating rubrics. both noteworthy achievements and pitfalls. By evaluating these two lessons, students gain valuable insight into the act Mixed calibration does not emphasize training — to make of evaluating peers’ writing and are provided with a model students better peer-reviewers — but score aggregation — how to to guide their own submissions. The students’ completed the identify the good reviewers and use their peer-review responses to calibration assignment, ranking each of the rubric categories aggregate grades for each artifact. Therefore, students who did on a 1-5 scale. Their results were then compared with the poorly on the peer-review do not receive any pedagogical “expert” review completed by the course instructor. intervention, though their identities are known. So the mixed calibration is used more often by classes of massive sizes, e.g.  Assignment 2: Course: Project Design and Management I; some courses in the Coursera system. Assignment: Practice Introduction to Peer Review. This assignment was designed to expose students to writing an 2.1 Calibration in Expertiza introduction for their senior project, to orient them to the Beginning in 2016, the Expertiza system has included a peer review process, and to understand the instructor’s calibration feature, which supports both stand-alone calibration expectations for the peer review assignment. The calibration and mixed calibration. In setting up an assignment, an instructor exercise had the students peer review two introductions can designate an assignment as a calibrated assignment, and from a previous class, one with a good grade and one that submit sample artifacts and “expert” reviews. The instructor can received a poor grade. The calibration exercise was give students the right to do reviews, but not submit work. This performed before the introduction was drafted. The general makes the assignment a stand-alone calibration assignment. introduction assignment included a draft with an in class (Ordinarily, students are permitted both to submit and to review.) peer review, a second draft peer review using Expertiza and the submission of a final draft. The review was done in double-blind style in Expertiza. In neither calibration mode did student reviewers see the expert review  Assignment 3: Course: Object-Oriented Design and before they finished reviewing an artifact. But, after a student Development; Assignment: Calibration for reviewing finishes reviewing an artifact that is a calibration sample that has Wikipedia pages. This assignment was to get the students been reviewed by the instructor, Expertiza shows a comparison ready to write and peer-review Wikipedia entries. The between the student’s review and the expert review (see Figure 1 instructor provided a list of topics on recent software- for an example). No update is allowed after the expert review is development techniques, frameworks, and products. Some displayed. of these topics had pre-existing Wikipedia pages; some did not. Where the pages existed, they were stubs or otherwise in need of improvement. Students could choose one topic and create the corresponding page. Then students were required to review at least two others’ artifacts and provide both textual feedback and ratings. We created a separate assignment for calibration. The sample artifacts were chosen from a previous semester. The instructor took two reviews done by good reviewers and made further changes in an effort to make the review of exemplary quality.  Assignment 4: Course: Object-Oriented Design and Development; Assignment: create and review CRC (Class- Figure 1. Comparison page of between student’s review and responsibility-collaborator) cards. CRC cards are an expert review approach to designing object-oriented software. The instructor’s students tended to make the same mistakes, 3. ASSIGNMENT DESIGN semester after semester. The goals of this calibration Three instructors at two universities set up a total of four assignment were to (1) allow students to submit their own calibration assignments using Expertiza. Those assignments used CRC-card design and (2) review some CRC-card designs calibration feature in Spring 2016 but did not have calibration in that contained common mistakes. In this assignment, each Fall 2015. Other than the calibration, those four assignments were student reviewed one of their peers’ designs, and two of the same settings including review rubrics. designs arranged by the instructor to contain common  Assignment 1: Course: Foundations and Introduction to mistakes. These designs were created by merging the errors Assessment of Education; Assignment: Grade Sample made by previous students on an exam. Lessons. This assignment was a precursor to engage Unlike the other three calibration assignments, this students in evaluating peers’ writing before they assessed assignment did not precede another assignment where the each other’s work. Pre-service teachers were asked to grade students submitted their own work. Rather, it was done as two different example lesson plans with a five-item rubric practice for the next exam. We asked the instructors to identify a few good reviewers in the of words. The Flesch-Kincaid readability index rates work actual peer-review assignments of exemplary quality to compare between 0 (difficult to read) and 100 (easy to read). the student performance on the calibration assignment and the Conversational English is usually between 80 and 90 on this actual assignments for which they received training. To test index. Text is considered to be hard to read (usually student performance on different assignments, we used the metrics requiring a college education or higher) if the index is lower below: than 50.  Percentage of exact agreement on each criterion. All the rubrics used in our experiments were scored on either a 0- 4. HOW CALIBRATION AFFECTS to-5 or a 1-to-5 scale. On each criterion, exact agreement STUDENT PERFORMANCE was when instructor and student gave exactly the same score. 4.1 Results for stand-alone calibration  Percentage of adjacent agreement on each criterion. On each The first three calibration assignments (Assignment 1, 2 and 3) criterion, adjacent agreement means that the score assigned were followed by an actual assignment where the students carried by the student is within ±1 of the instructor’s score. out the same kind of review on which they were calibrated. We measured the percentage of empty comments, average comment  Percentage of empty comment boxes. Some criteria asked length, and number of constructive comments in the response to students to give both a score and textual feedback. In the each criterion, and the overall readability. In the following actual calibration, the instructors tried to give textual feedback on assignment, we also measured the students’ agreement on all these criteria. If the sample artifact was in good shape, exemplary reviews (done by students). The results are shown in the instructors commented why it was good; otherwise, if Table 1. the sample artifact needed improvement, the instructors suggested changes for the author to consider. We hoped this In all three classes, we found there was a similar amount of exact would encourage students to comment on more of the agreement on calibration assignments and following assignment. criteria. But we observed increases in the adjacent agreement on the following assignment. The reason for that could be that the  Average non-empty comment length. We counted the words calibration phase led students to become more skilled and more in the non-empty responses. In calibration, the expert polite as reviewers. The instructor of assignment 1 observed that reviews were usually longer than the average of students’ her students were critical or even bullying, in their peer reviews at review (see Figure 1 for example). the very beginning of the semester. In the calibration phase,  Average of number constructive comments. We tried to students were able to see how the instructor reacted to various measure how much constructive content was provided in the issues and what the instructor grades were. This gave students non-empty responses. We used the same constructive guidance on how to rate artifacts that still needed improvement. lexicon used by Hsiao and Naveed [10], [11]. This lexicon We also noted that the percentage of empty comments dropped focuses mainly on assessment, emphasis, causation, between the calibration assignment and the assignment right after, generalization, and conditional sentence patterns. indicating students were more willing to give comments after the calibration. Relative to the previous semester, two of the three  Readability. We used the Flesch-Kincaid readability index classes had a lower empty-comment percentage on corresponding [12], which considers the length of sentences and the length assignments. Table 1. Metrics for calibration assignments, the assignments following the calibration assignment, and the corresponding actual assignment in the previous semester Avg. Avg. non- Exact Adjacent Empty number of Assignment empty Readability agreement % agreement % comment % constructive comment comments length Assgt. 1 53.20% 83.80% 31.80% 17.4 0.35 58.9 Calibration Assgt. 2 21.60% 32.10% 17.40% 22.1 0.31 49.8 assignment Assgt. 3 45.90% 85.80% 11.20% 18 0.27 54.4 Assignment Assgt. 1 48.00% 86.70% 26.80% 21.8 0.44 63.2 right after the Assgt. 2 26.70% 61.70% 13.20% 21.2 0.35 50.8 calibration assignment Assgt. 3 49.10% 92.00% 8.50% 14.4 0.25 55.9 Corresponding Assgt. 1 N/A N/A 20.80% 18.3 0.36 62.6 actual Assgt. 2 N/A N/A 15.10% 28 0.48 51.5 assignment from former semester Assgt. 3 N/A N/A 46.10% 8.6 0.14 57.2 The comment length between the calibration assignment and the artifact. The green color highlights the expert grade, and the following assignment were almost the same. Two out of three bolded number was the plurality of students’ grades. classes had a higher average comment length after they did Figure 2 shows a sample artifact where the calibration was quite calibration, compared with corresponding assignments last successful, with exact agreement of more than 40% and adjacent semester. agreement of almost 80%. However, it is still not clear that if it From the amount of constructive content per response to each was related to the quality of the artifact. When we calculate the criterion, we found that the students tended to give as many or percentages of agreements for each sample artifacts, we found that more constructive comments in the peer-review after the the level of agreement is related to the quality of the artifact: the calibration. Two out of three classes made more constructive higher grade that a sample had, the higher agreement that students comments after calibration compared with corresponding might achieve. This raises another question: what kind of artifacts assignments last semester. work better as samples in calibration? In this study, we found that students tended to write more complicated sentences in calibration tasks, but in the assignments right after the calibration, their comments were a little easier to read but close to college level, which was acceptable to instructors. 4.2 Results for mixed calibration Assignment 4 was our only experiment with the mixed calibration mode: each student reviewed two calibration submissions and one submission from their classmates. Unlike Assignments 1–3, which aimed to train students to become better reviewers on the actual peer assessment, Assignment 4 was not followed with an “actual” assignment on the same topic. Instead, Assignment 4 was designed to give students the opportunity to see common mistakes Figure 2. A calibration report on Expertiza system that others had made on a certain kind of question (on CRC-card design) on exams in earlier semesters. We put the percentages of agreement and grades for the artifacts together to compare the relationship between the agreement and On Assignment 4, the percentage of exact agreement was 52.2% the grades that the sample artifacts received. We used both the and percentage of adjacent agreement was 91.3%, which were sample artifacts and the artifacts reviewed by the exemplary both very high. This was partially due to a review rubric that reviewers. The distribution and fit line are shown below. asked students to count the number of errors of certain types (e.g. the number of class names that are not singular nouns), instead of ordinary rubric criteria that ask students to rate the artifact on some aspect (e.g., the language usage of an article). This rubric design reduces ambiguity and thereby increased the agreements. The percentage of the empty comment was 77.0%, the average of non-empty comment length was 5.4 and average of number constructive comments was 0.13, which are all lower than Assignment 1-3. The ostensible reason was that the review rubric was not designed to encourage students to give textual comments, but simply to count the errors. The review readability index was 60.1, which indicates that for those reviewers who gave textual feedback, the feedback was not short and simple as we expected. We hypothesized that after this calibration, student's’ average Figure 3. Relationship between adjacent agreement percentage score on related questions on the exam would be higher. We and sample grade compared the student performance on CRC-card related questions in exams of this semester (with calibration as training) and last semester (without training). However, we found that the students’ average grade was 85.3% on those questions in this semester, and 85.4% on last semester. We did not find any significant change between this semester and last semester. Upon seeing those results, we surmised this calibration assignment was done several weeks before the next exam, and, without follow-up practice, students forgot the training they received. 5. WHAT SAMPLE ARTIFACTS WE SHOULD USE FOR CALIBRATION? After students finish the calibration, the instructor can see the calibration reports for each artifact, as shown in Figure 2. Each table shows the students’ grades on each question on a sample Figure 4. Relationship between the exact agreement percentage and sample grade We find that the samples that received higher grades usually have low-quality artifacts as samples and (2) the instructor can provide higher levels of agreement (on both exact agreement and adjacent “advice” for each level of each criterion. agreement). The lower quality a sample is, the lower agreement One future study we are interested in is to calibrate the textual we observed between teaching staff and students. feedback. In this paper, we have only calibrated the numerical We looked into the samples used in each assignment, and we scores. It is possible that both a student and the teaching staff found that usually it is harder for students to make the same gave a ⅘ on one criterion on a sample artifact, but may not see judgment as teaching staff on an artifact of low quality. There the same issue. This kind of agreement can only be measured by could be multiple reasons. The first reason is that teaching staff calibration of textual feedback. has seen more artifacts, therefore they know the distribution of the quality of the artifacts and thereby they made better judgments. 7. REFERENCES For student reviewers, they may be able to tell an artifact is of low [1] E. F. Gehringer, “A Survey of Methods for Improving quality based on one criterion, but they could be more critical Review Quality,” in New Horizons in Web Based Learning, than warranted since they have not seen even worse examples. Y. Cao, T. Väljataga, J. K. T. Tang, H. Leung, and M. From this perspective, it is important for instructors to use at least Laanpere, Eds. Springer International Publishing, 2014, pp. one or two low-quality sample artifact as a sample artifact to show 92–97. students how to rate poor work. [2] Y. Song, Z. Hu, and E. F. Gehringer, “Closing the Circle: Another factor that may lower the agreement between teaching Use of Students’ Responses for Peer-Assessment Rubric staff and students is the reliability of the criterion: some of the Improvement,” in Advances in Web-Based Learning -- criteria are not specific enough for the reviewers to make reliable ICWL 2015, F. W. B. Li, R. Klamma, M. Laanpere, J. judgments [2]. E.g. the criterion, “(On Likert scale) does the Zhang, B. F. Manjón, and R. W. H. Lau, Eds. Springer author provide enough examples in this article?” is not reliable, International Publishing, 2015, pp. 27–36. since “enough” is not well defined. To improve review rubrics, [3] R. Robinson, “Calibrated Peer ReviewTM,” Am. Biol. instructors can create “advice” for each level (sometimes known Teach., vol. 63, no. 7, pp. 474–480, Sep. 2001. as an “anchored scale”). For example, “⅕ - No example [4] A. Russell, “Calibrated peer review-a writing and critical- provided”, etc. From this perspective, the calibration can also be thinking instructional tool,” in Teaching Tips: Innovations used to test the instructor’s review rubric. in Undergraduate Science Instruction, 2004, p. 54. [5] C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller, 6. CONCLUSION “Tuned Models of Peer Assessment in MOOCs,” In this paper, we have described our experience with the ArXiv13072579 Cs Stat, Jul. 2013. calibration in peer assessment in Expertiza. We first introduced [6] Y. Wang, Y. Jiang, M. Chen, and X. Hao, “E-learning- two modes of calibration that have been used in online peer oriented incentive strategy: Taking EduPCR system as an assessment systems, which are stand-alone calibration and mixed example,” World Trans. Eng. Technol. Educ., vol. 11, no. calibration. Stand-alone calibration trains students to become 3, pp. 174–179, Nov. 2013. better reviewers, while mixed calibration finds credible reviewers [7] E. Gehringer, “Expertiza: information management for in the course of performing peer assessment. We also discussed collaborative learning,” Monit. Assess. Online Collab. the pedagogical scenario in which each mode is suitable. Environ. Emergent Comput. Technol. E-Learn. Support, pp. 143–159, 2009. We calculated the agreement between students’ rating and [8] J. R. Wright, C. Thornton, and K. Leyton-Brown, teaching staff’s rating on the sample artifacts. We found that “Mechanical TA: Partially Automated High-Stakes Peer students in our assignments, on average agreed exactly with Grading,” in Proceedings of the 46th ACM Technical teaching staff on more than 40% of ratings. This means that on Symposium on Computer Science Education, New York, more than 40% of the ratings done by students during calibration NY, USA, 2015, pp. 96–101. gave exactly the same scores given by teaching staff. In addition, [9] C. Schunn, A. Godley, and S. DeMartino, “The Reliability more than 70% of the ratings done by students gave the score and Validity of Peer Review of Writing in High School within the ±1 range to the scores given by teaching staff. To test if AP English Classes,” J. Adolesc. Adult Lit., p. n/a-n/a, Apr. students still perform as well on the actual peer assessment after 2016. training, we asked the teaching staff to identify some good [10] I. H. Hsiao and F. Naveed, “Identifying learning-inductive reviewers in each course. Using their reviews as exemplars, we content in programming discussion forums,” in IEEE found that, in the actual peer assessment phases, the agreement Frontiers in Education Conference (FIE), 2015. 32614 was similar to that on the calibration assignments, sometime even 2015, 2015, pp. 1–8. a little higher. [11] Y. Song, Z. Hu, Y. Guo, and E. Gehringer, “An We compared the volume of textual feedback from the semester Experiment with Separate Formative and Summative with calibration and the previous semester without calibration. We Rubrics in Educational Peer Assessment,” in Submitted to found that after calibration, students tend to give more extensive IEEE Frontiers in Education Conference (FIE), 2016, textual feedback, fill in more text boxes with comments, and give 2016. more constructive feedback. [12] J. P. Kincaid and A. Others, “Derivation of New We also found that the level of rating agreement between students Readability Formulas (Automated Readability Index, Fog and teaching staff is related to the quality of the artifact; namely Count and Flesch Reading Ease Formula) for Navy students tended to agree less with teaching staff on artifacts of low Enlisted Personnel.,” Feb. 1975. quality. To improve agreement, we suggested: (1) on the calibration, an instructor can use both median-quality artifacts and