The role of initial input in reputation systems to generate
    accurate aggregated grades from peer assessment
                Zhewei Hu                                      Yang Song                                Edward F. Gehringer
   Department of Computer Science                  Department of Computer Science                  Department of Computer Science
    North Carolina State University                 North Carolina State University                 North Carolina State University
        Raleigh, United States                          Raleigh, United States                          Raleigh, United States
           zhu6@ncsu.edu                                  ysong8@ncsu.edu                                   efg@ncsu.edu


ABSTRACT                                                                to judge how reliable each assessor is. Several reputation
High-quality peer assessment has many benefits. It can not only         algorithms have already been created to calculate reputations from
offer students a chance to learn from their peers but also help         peer assessment grades [2, 3, 4]. Basically, each algorithm will
teaching staff to decide on grades of student work. There are many      consider one or more measurements, such as validity, reliability,
ways to do quality control for educational peer review, such as a       spread1, etc. The reputation can be used in different ways, such as
calibration process, a reputation system, etc. Previous research has    to “… give credit to students for careful reviewing or to weight
shown that reputation systems can help to produce more accurate         peer-assigned grades” [1]. According to previous research,
aggregated grades by using peer assessors’ reputations as weights       reputation systems can play an important role in educational peer
when computing an average score. However, for certain kind of           review systems. Moreover, using reputation algorithms to
assignments, there still exist big gaps (larger than 10 points out of   compute peer grades is indeed more effective than the naive
100, on average) between expert grades (grades given by more            average approach [5].
than one expert markers) and aggregated grades (grades computed         Although aggregated grades with reputations as weights can
from peer assessment). In order to narrow down the gap and              outperform naive averages, there is still much room for
improve the accuracy of aggregated grades, we designed three            improvement. According to our previous research, for
experiments using different initial inputs (reputations) for            assignments based on writing, most of the time, the average
reputation systems. These initial inputs came from calibration          absolute bias between expert grades and aggregated grades was
assignment, previous review rounds and previous assignments.            less than 4 points out of 100. In this case, when teaching staff
Our experiments show that under certain conditions, the accuracy        giving expert grades, they can use aggregated grades generated by
of aggregated grades can be significantly improved. Furthermore,        reputation systems as references [5]. Since aggregated grades can
for assignments not achieving the desired results, we analyzed that     be available immediately after peer assessment stage is finished,
the reason can be the mediocre design of review rubrics and             which is prior to the expert grading stage. The availability of
teaching staff’s idiosyncratic grading style.                           aggregated grades can give teaching staff a general idea about the
                                                                        quality of each artifact based on assessors’ point of views and
Keywords                                                                help teaching staff to decide the expert grades. If we can narrow
Peer assessment; peer grading; educational peer review;                 down the gap even further, we may be able to dispense with the
reputation systems                                                      expert grading of writing, and we can use aggregated grades
                                                                        instead. But we believe spot-checking is still necessary. However,
                                                                        for assignments based on both writing and programming, there are
1. INTRODUCTION                                                         still big gaps between aggregated grades and expert grades (naive
Peer assessment is commonly used in colleges, universities and in       average bias larger than 10 points out of 100) even after applying
MOOCs. It can offer assessors a chance to learn from their peers        reputation systems.
and improve their understanding of the assignment requirements.
Peer assessment can also help teaching staff to decide on grades        Hence, it is necessary to improve the accuracy of aggregated
for student work. However, in order to make the peer assessment         grades and reduce the burden on teaching staff. We designed three
process credible, we need a way to distinguish good peer                experiments and tried to answer these three questions:
assessors from bad ones. One solution is to use reputation systems      1.   Can reputation be taken from a formative review round to a
[1]. In reputation systems, each peer assessor will have one or              summative review round?
more reputation values. Reputation is a quantization measurement        2. Can reputation be taken from calibration to real assignments?
                                                                        3. Can reputation be taken from one assignment to a different
                                                                             one?
                                                                        In each experiment, we attempted a new set of data as initial input
                                                                        for reputation systems. In later parts of this paper, we will narrate

                                                                        1 Spread is a metric to measure the tendency of an assessor to

                                                                             assignment scores to different work. Generally speaking, a
                                                                             higher spread is better, because it indicates the peer assessor can
                                                                             distinguish good artifacts from bad ones.
the detail of experimental design and analyze the results of           In Expertiza, there are three main types of review rubric
experiments.                                                           questions, that is, choice question, text response and upload file.
                                                                       The choice question has two subtypes, scored question and
                                                                       unscored question. Criterion is scored question; conversely,
2. REPUTATION SYSTEMS                                                  dropdown, multiple-choice and checkbox are unscored questions.
In this paper, we focus on the performance of Hamer’s and
                                                                       It is worth noting that in Expertiza, only scored questions are
Lauw’s algorithm [2, 3] since they are both iteration-based and
                                                                       included in peer assessment grades. As a scored question type,
comparable to each other.
                                                                       criterion is the combination of dropdown and text area. It means
To compute the reputation of each peer assessor, Hamer’s               that peer assessors can not only give a score to certain question,
algorithm first assigns the same weight to all of the assessors,       but also write some text comments. Moreover, criterion is one of
which is 1 [2]. In each iteration, the algorithm calculates weighted   the most frequently used question types in Expertiza.
average for each artifact based on peer assessors’ reputations.
                                                                       Past research shows that the performance of Hamer’s and Lauw’s
Then Hamer’s algorithm computes the difference between
                                                                       algorithm varies a lot on different kinds of assignments [5].
aggregated grade of each artifact and each peer assessment grade.
                                                                       Hence, we tried to consider different assignment categories in our
The larger this difference is, the more inconsistent this peer
                                                                       experimental design. According to different types of submissions,
assessor compares with others. After that, the algorithm updates
                                                                       we classified five assignments (including one for calibration use,
the reputation of each peer assessor accordingly and calculate the
                                                                       which will be mentioned in Section 4) into three categories:
aggregate grade for each artifact again till the grades converge.
                                                                       writing assignment, programming assignment and assignment
The steps of Lauw’s algorithm are similar to Hamer’s. But Lauw’s
                                                                       combining writing with programming. We classified Wikipedia
algorithm applies different arithmetic formulas to calculate the
                                                                       calibration assignment as a writing assignment because it helped
differences and scale the final results.
                                                                       students to improve their peer assessment skills on writing
                                                                       assignment. Wikipedia contribution is also classified as a writing
3. DATA COLLECTION AND                                                 assignment; Program 1 is a programming assignment; OSS project
EXPERIMENTAL DESIGN                                                    is considered as an assignment combining writing with
In this section, we provide an overview of the experimental design     programming. For final project, although it contains both writing
and validate the dataset we collected for later experiments.           section and programming section, we consider it as a writing
                                                                       assignment here. The reason is that students only peer assessed
3.1 Class Setting                                                      their peers’ design documents due to the shortage of time.
We collected our data from two courses (CSC 517, Fall 2015 and
Spring 2016 in NC State University) from Expertiza, a web-based        3.2 Data Verification
educational peer review system [6]. For CSC 517, Fall 2015, 92         We ran Hamer’s and Lauw’s algorithm on our dataset and
students enrolled the course (98.9% graduate level, 17.4%              checked whether aggregated grades with reputations as weights
female). These students come from different countries but with a       were more accurate than naive averages. Table I shows the
predominance from India (78.3% India, 9.8% China, 9.8% United          comparison results between aggregated grades and naive averages.
States, 2.1% other countries). Moreover, most students major in        Two metrics are used to measure the accuracy of aggregated
Computer Science (88.9% Computer Science 3.7% Computer                 grades, namely, average absolute bias and root mean square error
Engineering, 3.7% Electrical Engineering, 3.7% Computer                (RMSE). Average absolute bias indicates the average distance
Networking). For CSC 517, Spring 2016, 54 students enrolled the        between aggregated grade and expert grade. RMSE is another
course (94.4% graduate level, 13% female). The majority of             frequently used measurement to present the differences between
students also major in Computer Science (79.3% Computer                values. Lower average absolute bias and RMSE means better
Science 9.8% Computer Engineering, 7.6% Electrical                     performance.
Engineering, 3.3% Computer Networking). They come from
different countries (74.1% India, 14.8% China, 9.3% United             Table I shows that aggregated grades calculated by reputation
States, 1.8% other countries).                                         algorithms perform better than naive averages in six out of eight
                                                                       assignments. In all these six assignments, Hamer’s algorithm
Each course contains four assignments, all of which are graded on      outperforms Lauw’s algorithm, so we only used Hamer’s
a scale of 0 to 100. They are a Wikipedia contribution (writing a      algorithm for later experiments. Two assignments which violate
Wikipedia entry on a given topic), Program 1 (building an              our expectation are Wikipedia contribution, Spring 2015 and
information management system with Ruby on Rails web                   Program 1, Spring 2016. For Wikipedia contribution, Spring
application framework), OSS project (typically, refactoring an         2015, when comparing with expert grades, aggregated grades
open-source software model) and the final project (adding new          produced by Hamer’s algorithm show less validity than naive
features to an open-source software project).                          averages. One potential reason is that Hamer’s algorithm uses the
For each assignment, students have to write at least two peer          square to amplify the differences between peer assessment grades
assessments by completing different kinds of review rubric             and aggregated grades during each iteration, which makes
questions. Furthermore, they can do more peer assessments for          assessors’ reputations have larger variance and degrades the
extra credit. Several policies are proposed to avoid students          performance of Hamer’s algorithm. For Program 1, Spring 2016,
playing the system. In summary, each Wikipedia contribution            it is very likely that the instructor-defined test cases in this
artifact received 9 peer assessments on average; each Program 1        semester are not as elaborated as those in last semester (the
artifact received 15 peer assessments on average; for OSS project,     average absolute bias of Program 1 in Fall 2015 is much less than
the number is 13. For final project, 20 assessors evaluated each       those in Spring 2016).
artifact on average.
                                        Table I Comparison of differences among aggregated grades
                                         from Hamer’s and Lauw’s algorithm and naive averages

                                    Hamer’s       Lauw’s        Naive                                       Hamer’s          Lauw’s              Naive
  Assgt. name         Metric                                              Assgt. name         Metric
                                      alg.         alg.        average                                        alg.            alg.              average

   Wikipedia      Avg. abs. bias       4.62         3.49         3.50      Wikipedia       Avg. abs. bias        2.91             3.15           3.17
   contrib.,                                                               contrib.,
   Fall 2015          RMSE             6.13         4.71         4.72     Spring 2016         RMSE               3.61             3.89           3.94

    Prog. 1,      Avg. abs. bias       4.32         5.58         6.21       Prog. 1,       Avg. abs. bias    11.46            10.77              10.59
   Fall 2015          RMSE             5.84         7.59         8.19     Spring 2016         RMSE           13.06            12.46              12.36

  OSS project, Avg. abs. bias          5.30         6.55         7.29     OSS project,     Avg. abs. bias        5.22             6.90           7.00
   Fall 2015      RMSE                 6.49         7.46         8.06     Spring 2016         RMSE               6.12             8.47           8.57

                  Avg. abs. bias       4.64         5.93         6.27        Final         Avg. abs. bias        4.65             5.91           6.07
 Final project,
                                                                            project,
   Fall 2015          RMSE             7.52         8.70         9.03                         RMSE               5.91             7.48           7.61
                                                                          Spring 2016
The instructor-defined test case is more like a test case in software       This test shows that there are different fixed points in this dataset.
engineering, is a set of conditions to check whether an application         If we use 1 as initial input, it will often lead to a “reasonable”
is working as it was originally designed [7]. The purpose of                fixed point, but not always [2]. Instead, if we have prior
instructor-defined test cases is to help students to understand the         knowledge about which assessor might be credible or not, we
requirements of the certain assignment and also help teaching staff         should make use of this prior knowledge and the algorithm may
to grade the students’ artifacts. For instance, “Can an admin delete        converge to a more reasonable fixed point accordingly.
other admins other than himself and the preconfigured admin?” is
                                                                            Thus, we tried to use different initial inputs to obtain more
an instructor-defined test case. This question was used in both
                                                                            accurate aggregated grades. Instead of assigning a random initial
review rubric and expert grading stage. By manually testing a
                                                                            reputation to each student, considering other available data, such
series of instructor-defined test cases, teaching staff and peer
                                                                            as reputations from another review round, calibration results [9]
assessors are able to decide the grades. Instructor-defined test
                                                                            or reputations from former assignments can be better ways to
cases are used a lot in Program 1. Because Program 1 is not a
                                                                            achieve more reasonable results.
topic-based assignment and all students are required to build web
applications with same functionalities, it is easier for the                         Table II. Scores assigned by four peer assessments
instructor to create such test cases comparing with assignments                                        to four artifacts
with different topics.
Hamer's algorithm is iteration-based, which means the algorithm                                Peer                     Assessor
will take several iterations before a solution (fixed point) is                             assessment
reached. However, results generated by these two algorithms can                                grade         a          b     c          d
be locally optimal solutions, instead of a globally optimal solution
[2]. It means the result of each algorithm can be optimal within a                        Artifact     1    10          9    10          8
neighboring set of candidate solutions, instead of the optimal
solution among all possible solutions [8]. One reason is that the                                      2     7          6     8          7
initial reputation assigned to each peer assessor is always equal to
                                                                                                       3     7          -     2          4
1, which mandatorily sets each peer assessor’s ability the same at
the very beginning.
                                                                                                       4     6          7     3          3
In order to verify that different initial inputs will lead to different
fixed points, we assembled a very small set of peer assessment
                                                                                          Table III. Reputations with initial input
records shown in Table II. There are four peer assessors (a, b, c,
                                                                                                equal to 1 and other values
d) who assessed four artifacts (1, 2, 3, 4). To make the dataset
more similar to a real scenario, we assumed that assessor b did not
                                                                             Assessor       Rep. values with init.          Rep. values with init.
assess artifact 3.
                                                                                              rep. all eq. to 1              rep. not all eq. to 1
We used two sets of data as initial inputs for Hamer’s algorithm,
whose reputation range is [0, ∞). The first set of initial input is the          1                   0.50                                2.66
same as the default setting of Hamer’s algorithm, 1 for all
assessors; the second set of initial input is arbitrarily chosen, that           2                   0.77                                2.67
is, 0.2 for assessor 3, and 1 for the rest. The final reputations are
shown in Table III. As you can see, we obtained two sets of data                 3                   2.00                                0.42
with totally different results. It is obvious that different initial
inputs affect the final results.                                                 4                   2.59                                0.79
3.3 Research Questions                                                as initial input of the subsequent real assignment, can produce
This part presents three research questions. By answering these       more accurate aggregated grades.
questions, we can figure out whether replacing the initial input of
Hamer’s algorithm from 1 to some other available data in the          3.3.3 Can reputation be taken from one assignment
same course will obtain more accurate results.                        to a different one?
                                                                      In our dataset, there are four real assignments in a fixed order in
3.3.1 Can reputation be taken from a formative                        each semester. We hypothesized that the assess credibility of the
review round to a summative review round?                             same assessor on one assignment and a subsequent one are related
Since Fall 2015, Expertiza has allowed different rubrics to be        and reputations calculated from one assignment, if used as initial
used in each round of review. For each assignment with this           input of the subsequent assignment, can produce more accurate
feature, students were encouraged to finish two rounds of peer        aggregated grades.
assessments - a formative review round and a summative review         What’s more, since we have already classified all assignments into
round. During the formative review round, the teaching staff          three categories, we also assumed that using reputations from one
presented an elaborate formative rubric to peer assessors. Two        assignment as the initial input of a subsequent assignment of the
questions asked in formative rubric are presented below. “Rate        same category can also produce more accurate aggregated grades.
how logical and clear the organization is. Point out any places
where you think that the organization of this article needs to be
improved.”. “List any related terms or concepts for which the         4. EXPERIMENTS AND ANALYSIS
writer failed to give adequate citations and links. Rate the          According to three questions listed in the last section, we did
helpfulness of the citations.” The purpose of these questions is to   corresponding experiments to verify derived hypotheses. Since in
encourage peer assessors to look into the artifact, point out the     data verification section, Hamer’s algorithm outperforms Lauw’s
problems and offer insightful suggestions [10]. After one assessor    algorithm for six out of eight assignments, we only displayed the
submitted formative feedback, Expertiza calculated the assessment     reputation results from Hamer’s algorithm in experiments.
grade based on scored questions in the formative rubric.
                                                                      4.1 Can reputation be taken from a formative
After that, authors will have a chance to modify their work
according to information given by their peers. In the summative
                                                                      review round to a summative review round?
review round, teaching staff offered a summative rubric which is      Table IV shows the differences among aggregated grades, naive
designed to guide peer assessors to evaluate the overall quality of   averages and expert grades on assignments in CSC 517, Spring
artifacts and check whether authors made the changes they             2016 by using two metrics (average absolute bias and RMSE).
suggested in the formative review round. Below are two questions      The reason why we chose CSC 517, Spring 2016 is because all
used in the summative rubric. “Coverage: does the artifact cover      assignments in this course support two rounds of peer
all the important aspects that readers need to know about this        assessments.
topic? Are all the aspects discussed at about the same level of       We found that using reputation results from the formative review
detail?”. “Clarity: Are the sentences clear, and non-duplicative?     round as initial input does not work well in all assignments.
Does the language used in this artifact simple and basic to be        Among four assignments, two of them (Program 1, Spring 2016
understood?” After assessors submitted their summative feedback,      and final project, Spring 2016) saw improvement by using this
Expertiza calculated the assessment grades again for each artifact    method. One of them (Wikipedia contribution, Spring 2016)
received new feedback.                                                converged to the same fixed point as using 1 as initial input. Since
We hypothesized that the assess credibility of the same assessor      Hamer’s algorithm is iteration-based, it is possible that different
on formative and summative review round are related and               initial inputs converge to the same fixed point. The last one (OSS
reputations calculated from the formative review round, if used as    project, Spring 2016) fared even worse with alternative initial
initial input of the summative review round, can produce more         input. Overall, we were not able to draw the conclusion that
accurate aggregated grades.                                           whether initial input from formative review round is a good input
                                                                      option of Hamer's algorithm.

3.3.2 Can reputation be taken from calibration to                     One potential reason is that according to first author’s master’s
real assignments?                                                     thesis, peer assessment records from formative review round
                                                                      would generate less accurate aggregated grades comparing with
At the beginning of the Spring 2016 semester, we created a
                                                                      peer assessment records from summative review round. It is
Wikipedia calibration assignment before real assignments. The
                                                                      because during the formative review round, peer assessors were
instructor selected several representative artifacts from former
                                                                      encouraged to offer suggestions, and authors might make changes
semesters. These artifacts had major differences in quality. Then
                                                                      before the summative review round. Hence, it is possible that peer
the instructor submitted an expert peer assessment based on the
                                                                      assessments based on the initial version of products were not
same review rubric that students would use for each artifact.
                                                                      accurate. However, during the summative review round, artifacts
During the class, students assessed those artifacts on Expertiza.
                                                                      were unchangeable, and the same version as the one teaching staff
After that, Expertiza generated the report for both the instructor
                                                                      graded. Therefore, it is reasonable that when comparing with
and students. According to the report, the instructor analyzed the
                                                                      expert grades, peer assessments during the summative review
results and helped students enhance their peer assessment skills.
                                                                      round could have higher validity than those during the formative
We hypothesized that the assess credibility of the same assessor      review round. And initial input from formative review round
on the calibration assignment and the real assignment are related     cannot help to improve the accuracy of aggregated grades.
and reputations calculated from the calibration assignment, if used
                       Table IV Comparison of differences between aggregated grades from Hamer’s algorithm
                                   with initial input equal to 1 and from formative review round

                                                                                                  Initial input from
                Assgt. name                      Metric         Initial input equal to 1                                    Naive average
                                                                                               formative review round
                                              Avg. abs. bias                2.91                         2.91                     3.17
    Wikipedia contribution, Spring 2016
                                                  RMSE                      3.61                         3.61                     3.94

                                              Avg. abs. bias             11.46                          11.35                    10.59
           Program1, Spring 2016
                                                  RMSE                   13.06                          12.96                    12.36

                                              Avg. abs. bias                5.22                         5.29                     7.00
          OSS project, Spring 2016
                                                  RMSE                      6.12                         6.16                     8.57

                                              Avg. abs. bias                4.65                         4.54                     6.07
         Final project, Spring 2016
                                                  RMSE                      5.91                         5.77                     7.61

             Table V Comparison of differences between naive averages and aggregated grades from Hamer’s algorithm
                                  with initial input equal to 1 and from calibration assignment

                                                    Wikipedia contribution, Spring 2016

                              Different sets of aggregated grades                  Avg. abs. bias          RMSE
                            Hamer’s alg. with initial input equal to 1                 2.91                 3.61

                         Hamer’s alg. with initial input from calibration              2.80                 3.51

                                          Naive averages                               3.17                 3.94

4.2 Can reputation be taken from calibration                                 4.3 Can reputation be taken from one
to real assignments?                                                         assignment to a different one?
In this experiment, we further tested whether the aggregated                 In this experiment, we tried to test the hypothesis that using
grades can be improved by using calibration results as initial               reputations from former assignments as initial input will get more
input. Since we trialed calibration assignment for Wikipedia                 accurate aggregated grades. Both course CSC 517, Fall 2015 and
contribution only in Spring 2016 semester, for this hypothesis we            CSC 517, Spring 2016 have Wikipedia contribution, Program 1,
did the experiment based on data from Wikipedia calibration,                 OSS project and final project. And these four assignments are in
Spring 2016 and Wikipedia contribution, Spring 2016.                         fixed order. To verify this hypothesis, we designed three sub-
There was only one round of peer assessment in Wikipedia                     experiments separately on these two courses. The first sub-
calibration, Spring 2016. And we used the formative review rubric            experiment is based on Wikipedia contribution and Program 1. In
in this assignment just the same one used in Wikipedia                       this sub-experiment, we used initial input from the Wikipedia
contribution, Spring 2016. After assessors submitted their                   contribution assignment and peer assessment records from
feedback, Expertiza computed the assessment grades for                       Program 1 to compute the aggregated grades. We compared these
representative artifacts. Then the instructor submitted expert peer          results with aggregated grades that based on the initial input equal
assessments based on the same formative review rubric. After that,           to 1. The second sub-experiment was designed between Program
we calculated each assessor’s reputation value based on their                1 and OSS project. The same process as the first sub-experiment,
assessment grade and expert grade. When Wikipedia contribution,              we produced aggregated grades with the initial input from the
Spring 2016 finished, we used reputation values produced from                Program 1 and peer assessment records from OSS project. Then
calibration assignment as the initial input to compute a new set of          we compared these aggregated grades with those grades calculated
reputation values. We compared this new set of reputation values             with the initial reputation equal to 1. The third sub-experiment
with reputation values calculated based on initial reputation equal          was between OSS project and final project. The results are shown
to 1.                                                                        in Table VI.
Table V shows that both average absolute bias and RMSE are                   Table VI shows that in Fall 2015, aggregated grades with initial
decreased by using Hamer’s algorithm with calibration results as             input from former assignments have higher validity than those
initial input. However, data used in this experiment is quite                grades produced by initial input equal to 1. For Spring 2016, we
limited. If we want to further verify the efficacy of the calibration        found that among three sub-experiments, one of them (sub-
process, more data and more experiments are needed.                          experiment between Wikipedia contribution, Spring 2016 and
                                                                             Program 1, Spring 2016) converged to the same fixed point as
                                                                             using 1 as initial input, and another one (sub-experiment between
                                                                             Program 1, Spring 2016 and OSS project, Spring 2016) became
                       Table VI Comparison of differences between aggregated grades from Hamer’s algorithm
                                    with initial input equal to 1 and from former assignments


                                                                                                                            Naive
              Method                Metric        Hamer’s alg.           Method             Metric        Hamer’s alg.
                                                                                                                           average


             Wiki →             Avg. abs. bias         4.13                              Avg. abs. bias       4.32           6.21
                                                                    Prog 1, Fall 2015
         Prog 1, Fall 2015          RMSE               5.76                                 RMSE              5.84           8.19

              Wiki →            Avg. abs. bias        11.46           Prog 1, Spring     Avg. abs. bias      11.46          10.59
        Prog 1, Spring 2016         RMSE              13.06               2016              RMSE             13.06          12.36

            Prog 1 →            Avg. abs. bias         5.08                              Avg. abs. bias       5.30           7.29
                                                                      OSS, Fall 2015
          OSS, Fall 2015            RMSE               6.31                                 RMSE              6.49           8.06

            Prog 1 →            Avg. abs. bias         5.41            OSS, Spring       Avg. abs. bias       5.22           7.00
         OSS, Spring 2016           RMSE               6.36              2016               RMSE              6.12           8.57

              OSS →             Avg. abs. bias         4.52                              Avg. abs. bias       4.64           6.27
                                                                      Final, Fall 2015
          Final, Fall 2015          RMSE               7.46                                 RMSE              7.52           9.03

               OSS →            Avg. abs. bias         4.55            Final, Spring     Avg. abs. bias       4.65           6.07
         Final, Spring 2016         RMSE               5.81                2016             RMSE              5.91           7.61

                       Table VII Comparison of differences among aggregated grades from Hamer’s algorithm
                            with initial input equal to 1 and from Wikipedia contribution and Program 1

                                                          OSS project, Fall 2105

                         Different sets of aggregated grades                                Avg. abs. bias                RMSE
                       Hamer’s alg. with initial input equal to 1                                5.30                      6.49

    Hamer’s alg. with initial input of writing section from Wikipedia contribution               3.32                      4.44
             and initial input of programming section from Program 1.

                                    Naive averages                                               7.29                      8.01

                                                         OSS project, Spring 2016

                         Different sets of aggregated grades                                Avg. abs. bias                RMSE
                       Hamer’s alg. with initial input equal to 1                                5.22                      6.12
    Hamer’s alg. with initial input of writing section from Wikipedia contribution               4.45                      5.12
             and initial input of programming section from Program 1.

                                    Naive averages                                               7.00                      8.57
worse by using initial input from former assignments. The last              project, Spring 2016) obtained even worse results by using this
sub-experiment (between OSS project, Spring 2016 and final                  new method. One potential explanation is that Program 1 is a
project, Spring 2016) supported our hypothesis. In general, initial         programming assignment, but the OSS project combines writing
input from former assignments obtained equal or better results              section with programming section. During the grading of OSS
than initial input equal to 1 in five out of six experiments.               project, teaching staff gave scores to writing section and
Therefore, we believe that initial input from former assignments            programming section separately. These two scores are related, but
can increase the accuracy of aggregated grades.                             not always direct proportional with each other. If one team did a
                                                                            good job on programming and wrote the writing section
Although the new method introduced above (initial input from
                                                                            perfunctorily, they would get a low score of writing section
former assignments) can have better performance in most cases,
                                                                            regardless of their high score on programming section. However,
the improvement is limited. Most of the time, it only improved by
                                                                            if another team did not accomplish the programming section well,
less than 0.5 points in average absolute bias. What’s more, one
                                                                            they would not receive a high score on writing section in most of
sub-experiment (between Program 1, Spring 2016 and OSS
                                                                            the time. Both writing and programming scores are on a scale of 0
to 100. The final score of OSS project is the combination of these     Since this team did not commit new code or did not commit
two scores with corresponding weights defined by teaching staff.       promptly, the average of this question is 3.58 out of 5, which
                                                                       means on average more than 4 points will be taken off from the
In order to verify the effect of assignment categories and try to
                                                                       total score. Only these two questions have already deducted more
obtain more improvement, we designed a new experiment by
                                                                       than 12 points from the total score.
using initial input from both the Wikipedia contribution and
Program 1 acting on peer assessment records from OSS project.          What’s more, only 3 out of 31 artifacts got the grades lower than
That’s to say, the initial input of OSS project writing section came   90 and this one got 91. It is obvious that teaching staff also
from Wikipedia contribution and the initial input of OSS project       considered it is not a quite successful artifact. However, a
programming section came from Program 1. Furthermore, we also          relatively tolerant grade is still assigned to this team. So it can be
combined the aggregated grades of writing section and                  the reason why there are large differences between expert grades
programming section with the same weights used for producing           and aggregated grades. A new grading method or newly-designed
the final expert grades of OSS project. Table VII presents the         rubric may help to solve this problem.
experiment results in both Fall 2015 and Spring 2016. Comparing
with the results produced from initial input equal to 1, average
                                                                       6. CONCLUSIONS
absolute bias is decreased by more than 1.3 points on average by
                                                                       In this paper, we propose several novel methods to improve the
using this new method. It is a big improvement, which indicates
                                                                       accuracy of aggregated grades generated by reputation algorithms.
that assignment categories should be made into consideration in
                                                                       Since Hamer’s and Lauw’s algorithms are iteration-based, we
future work.
                                                                       tried different sets of initial inputs in order to get aggregated
                                                                       grades with least biases.
5. DISCUSSION
                                                                       We designed three experiments. The first one was to use
After three experiments, we found that average absolute biases of
                                                                       reputations from the formative review round as initial input into
some assignments are still high even using our new method,
                                                                       summative review round peer assessment records. Comparing
which means that there still exist some obvious differences
                                                                       with the initial input equal to 1, this method cannot help us to
between expert grades and aggregated grades. And there must be
                                                                       obtain aggregated grades with higher accuracy. Since after the
some other issues also affecting the aggregated grades and not
                                                                       formative peer assessment stage, authors have a chance to modify
being considered into these algorithms, such as mediocrely-
                                                                       their work, which makes the peer assessments in formative review
designed rubrics, insufficient peer-review training, etc.
                                                                       round inaccurate.
Then we figured out that the mediocre design of review rubrics
                                                                       The second experiment was to use the reputations from calibration
and teaching staff’s idiosyncratic grading style may help to
                                                                       assignment as initial input. The result shows that this method can
explain these high biases. For instance, OSS project, Fall 2015 is
                                                                       help us to get more accurate aggregated grades. However, lots of
an assignment with both formative review round and summative
                                                                       questions are needed to answer to further verify the efficacy of the
review round. Its summative rubric has seven questions. Each
                                                                       calibration process. For example, when should we let assessors
question in this rubric has the same weight and Expertiza uses the
                                                                       perform calibration process, at the beginning of the course or
naive average as the final grade, which means that each question
                                                                       before each real peer assessment stage? How many calibration
will affect more than 14% of final grade.
                                                                       processes do we need, just one or one for each assignment
One OSS artifact got 91 for the expert grade but only got              category? What content should be included in calibration? By
approximately 75 for aggregated grade. The final comments given        answering these questions, we can have a deeper understanding of
by teaching staff is                                                   calibration process and help to improve the quality of peer
                                                                       assessment.
“Well, from the video they did the thing we expect them to do, but
their tests are failing, and they should have fixed them.”             The last experiment focused on initial input taken from former
                                                                       assignments. The results supported our hypothesis that aggregated
In summative rubric, there is a test-related question                  grades calculated in this way can outperform naive averages. We
                                                                       also verified that under certain circumstances, considering
“IF it is an Expertiza project, check the pull request. Did the
                                                                       assignment categories will improve the accuracy of aggregated
build pass in Travis CI? Was there any conflict that must be
                                                                       grades a lot.
resolved? You can check those on the pull request on GitHub.
Ignore this question if it is not an Expertiza project.”               Our new methods can help to improve the performance, but the
                                                                       absolute averages biases of some assignments are still high. After
According to 13 valid peer-review records, most peer reviewers         looking into this, we figure out that there is still room for us to
were able to figure out this problem. And the average score of this    improve our review rubric to resolve ambiguity and provide more
question is 2.16 out of 5, which means that on average more than       guidance to students (e.g. training or calibration). What’s more,
8 points will be deducted from total score since the code did not      both Hamer’s and Lauw’s algorithm are rating-based. Some other
pass the TravisCI.                                                     educational peer review systems, such as Critviz 2 and Mobius
                                                                       SLIP 3 , measure the qualities of peer assessments based on
And during grading, teaching staff almost did not consider
                                                                       ranking. A different set of results might be found if we use
another question in this rubric. That is
                                                                       ranking-based algorithms. We hope we can solve these issues, use
“Check the commits. Was new code committed during the 2nd
round?”
                                                                       2 https://critviz.com/

                                                                       3 http://www.mobiusslip.com/
different kinds of algorithms and obtain aggregated grades with   [4] K. Cho, C. D. Schunn, and R. W. Wilson, “Validity and
even higher accuracy in the future.                                    Reliability of Scaffolded Peer Assessment of Writing From
                                                                       Instructor and Student Perspectives,” J. Educ. Psychol., vol.
                                                                       98, no. 4, pp. 891–901, 2006.
7. ACKNOWLEDGMENTS                                                [5] Y. Song, Z. Hu, and E. F. Gehringer, “Pluggable reputation
The Peerlogic project is funded by the National Science
                                                                       systems for peer review: A web-service approach,” in IEEE
Foundation under grants 1432347, 1431856, 1432580, 1432690,
                                                                       Frontiers in Education Conference (FIE), 2015. 32614
and 1431975.
                                                                       2015, 2015, pp. 1–5.
                                                                  [6] E. F. Gehringer, L. M. Ehresman, S. G. Conger, and P. A.
8. REFERENCES                                                          Wagle, “Work in Progress: Reusable Learning Objects
[1] E. F. Gehringer, “A Survey of Methods for Improving                Through Peer Review: The Expertiza Approach,” in
    Review Quality,” in New Horizons in Web Based Learning,            Proceedings. Frontiers in Education. 36th Annual
    Y. Cao, T. Väljataga, J. K. T. Tang, H. Leung, and M.              Conference, 2006, pp. 1–2.
    Laanpere, Eds. Springer International Publishing, 2014, pp.   [7] Wikipedia. (2016). Test case. [online] Available:
    92–97.                                                             https://en.wikipedia.org/wiki/Test_case.
[2] J. Hamer, K. T. K. Ma, H. H. F. Kwong, K. T. K, M. Hugh,      [8] Wikipedia. (2016). Local optimum. [online] Available:
    and H. F. Kwong, “A Method of Automatic Grade                      https://en.wikipedia.org/wiki/Local_optimum.
    Calibration in Peer Assessment,” in of Conferences in         [9] Y. Song, E. F. Gehringer, J. Morris, J. Kid, and S. Ringleb,
    Research and Practice in Information Technology,                   “Toward Better Training in Peer Assessment: Does
    Australian Computer Society, 2005, pp. 67–72.                      Calibration Help?,” presented at the EDM 2016, CSPRED
[3] H. Lauw, E. Lim, and K. Wang, “Summarizing Review                  workshop, 2016.
    Scores of ‘Unequal’ Reviewers,” in Proceedings of the 2007    [10] S. Yang, Z. Hu, Y. Guo, and E. F. Gehringer, “An
    SIAM International Conference on Data Mining, 0 vols.,             Experiment with Separate Formative and Summative
    Society for Industrial and Applied Mathematics, 2007, pp.          Rubrics in Educational Peer Assessment,” IEEE Front.
    539–544.                                                           Educ. Conf. FIE 2016, 2016.