The role of initial input in reputation systems to generate accurate aggregated grades from peer assessment Zhewei Hu Yang Song Edward F. Gehringer Department of Computer Science Department of Computer Science Department of Computer Science North Carolina State University North Carolina State University North Carolina State University Raleigh, United States Raleigh, United States Raleigh, United States zhu6@ncsu.edu ysong8@ncsu.edu efg@ncsu.edu ABSTRACT to judge how reliable each assessor is. Several reputation High-quality peer assessment has many benefits. It can not only algorithms have already been created to calculate reputations from offer students a chance to learn from their peers but also help peer assessment grades [2, 3, 4]. Basically, each algorithm will teaching staff to decide on grades of student work. There are many consider one or more measurements, such as validity, reliability, ways to do quality control for educational peer review, such as a spread1, etc. The reputation can be used in different ways, such as calibration process, a reputation system, etc. Previous research has to “… give credit to students for careful reviewing or to weight shown that reputation systems can help to produce more accurate peer-assigned grades” [1]. According to previous research, aggregated grades by using peer assessors’ reputations as weights reputation systems can play an important role in educational peer when computing an average score. However, for certain kind of review systems. Moreover, using reputation algorithms to assignments, there still exist big gaps (larger than 10 points out of compute peer grades is indeed more effective than the naive 100, on average) between expert grades (grades given by more average approach [5]. than one expert markers) and aggregated grades (grades computed Although aggregated grades with reputations as weights can from peer assessment). In order to narrow down the gap and outperform naive averages, there is still much room for improve the accuracy of aggregated grades, we designed three improvement. According to our previous research, for experiments using different initial inputs (reputations) for assignments based on writing, most of the time, the average reputation systems. These initial inputs came from calibration absolute bias between expert grades and aggregated grades was assignment, previous review rounds and previous assignments. less than 4 points out of 100. In this case, when teaching staff Our experiments show that under certain conditions, the accuracy giving expert grades, they can use aggregated grades generated by of aggregated grades can be significantly improved. Furthermore, reputation systems as references [5]. Since aggregated grades can for assignments not achieving the desired results, we analyzed that be available immediately after peer assessment stage is finished, the reason can be the mediocre design of review rubrics and which is prior to the expert grading stage. The availability of teaching staff’s idiosyncratic grading style. aggregated grades can give teaching staff a general idea about the quality of each artifact based on assessors’ point of views and Keywords help teaching staff to decide the expert grades. If we can narrow Peer assessment; peer grading; educational peer review; down the gap even further, we may be able to dispense with the reputation systems expert grading of writing, and we can use aggregated grades instead. But we believe spot-checking is still necessary. However, for assignments based on both writing and programming, there are 1. INTRODUCTION still big gaps between aggregated grades and expert grades (naive Peer assessment is commonly used in colleges, universities and in average bias larger than 10 points out of 100) even after applying MOOCs. It can offer assessors a chance to learn from their peers reputation systems. and improve their understanding of the assignment requirements. Peer assessment can also help teaching staff to decide on grades Hence, it is necessary to improve the accuracy of aggregated for student work. However, in order to make the peer assessment grades and reduce the burden on teaching staff. We designed three process credible, we need a way to distinguish good peer experiments and tried to answer these three questions: assessors from bad ones. One solution is to use reputation systems 1. Can reputation be taken from a formative review round to a [1]. In reputation systems, each peer assessor will have one or summative review round? more reputation values. Reputation is a quantization measurement 2. Can reputation be taken from calibration to real assignments? 3. Can reputation be taken from one assignment to a different one? In each experiment, we attempted a new set of data as initial input for reputation systems. In later parts of this paper, we will narrate 1 Spread is a metric to measure the tendency of an assessor to assignment scores to different work. Generally speaking, a higher spread is better, because it indicates the peer assessor can distinguish good artifacts from bad ones. the detail of experimental design and analyze the results of In Expertiza, there are three main types of review rubric experiments. questions, that is, choice question, text response and upload file. The choice question has two subtypes, scored question and unscored question. Criterion is scored question; conversely, 2. REPUTATION SYSTEMS dropdown, multiple-choice and checkbox are unscored questions. In this paper, we focus on the performance of Hamer’s and It is worth noting that in Expertiza, only scored questions are Lauw’s algorithm [2, 3] since they are both iteration-based and included in peer assessment grades. As a scored question type, comparable to each other. criterion is the combination of dropdown and text area. It means To compute the reputation of each peer assessor, Hamer’s that peer assessors can not only give a score to certain question, algorithm first assigns the same weight to all of the assessors, but also write some text comments. Moreover, criterion is one of which is 1 [2]. In each iteration, the algorithm calculates weighted the most frequently used question types in Expertiza. average for each artifact based on peer assessors’ reputations. Past research shows that the performance of Hamer’s and Lauw’s Then Hamer’s algorithm computes the difference between algorithm varies a lot on different kinds of assignments [5]. aggregated grade of each artifact and each peer assessment grade. Hence, we tried to consider different assignment categories in our The larger this difference is, the more inconsistent this peer experimental design. According to different types of submissions, assessor compares with others. After that, the algorithm updates we classified five assignments (including one for calibration use, the reputation of each peer assessor accordingly and calculate the which will be mentioned in Section 4) into three categories: aggregate grade for each artifact again till the grades converge. writing assignment, programming assignment and assignment The steps of Lauw’s algorithm are similar to Hamer’s. But Lauw’s combining writing with programming. We classified Wikipedia algorithm applies different arithmetic formulas to calculate the calibration assignment as a writing assignment because it helped differences and scale the final results. students to improve their peer assessment skills on writing assignment. Wikipedia contribution is also classified as a writing 3. DATA COLLECTION AND assignment; Program 1 is a programming assignment; OSS project EXPERIMENTAL DESIGN is considered as an assignment combining writing with In this section, we provide an overview of the experimental design programming. For final project, although it contains both writing and validate the dataset we collected for later experiments. section and programming section, we consider it as a writing assignment here. The reason is that students only peer assessed 3.1 Class Setting their peers’ design documents due to the shortage of time. We collected our data from two courses (CSC 517, Fall 2015 and Spring 2016 in NC State University) from Expertiza, a web-based 3.2 Data Verification educational peer review system [6]. For CSC 517, Fall 2015, 92 We ran Hamer’s and Lauw’s algorithm on our dataset and students enrolled the course (98.9% graduate level, 17.4% checked whether aggregated grades with reputations as weights female). These students come from different countries but with a were more accurate than naive averages. Table I shows the predominance from India (78.3% India, 9.8% China, 9.8% United comparison results between aggregated grades and naive averages. States, 2.1% other countries). Moreover, most students major in Two metrics are used to measure the accuracy of aggregated Computer Science (88.9% Computer Science 3.7% Computer grades, namely, average absolute bias and root mean square error Engineering, 3.7% Electrical Engineering, 3.7% Computer (RMSE). Average absolute bias indicates the average distance Networking). For CSC 517, Spring 2016, 54 students enrolled the between aggregated grade and expert grade. RMSE is another course (94.4% graduate level, 13% female). The majority of frequently used measurement to present the differences between students also major in Computer Science (79.3% Computer values. Lower average absolute bias and RMSE means better Science 9.8% Computer Engineering, 7.6% Electrical performance. Engineering, 3.3% Computer Networking). They come from different countries (74.1% India, 14.8% China, 9.3% United Table I shows that aggregated grades calculated by reputation States, 1.8% other countries). algorithms perform better than naive averages in six out of eight assignments. In all these six assignments, Hamer’s algorithm Each course contains four assignments, all of which are graded on outperforms Lauw’s algorithm, so we only used Hamer’s a scale of 0 to 100. They are a Wikipedia contribution (writing a algorithm for later experiments. Two assignments which violate Wikipedia entry on a given topic), Program 1 (building an our expectation are Wikipedia contribution, Spring 2015 and information management system with Ruby on Rails web Program 1, Spring 2016. For Wikipedia contribution, Spring application framework), OSS project (typically, refactoring an 2015, when comparing with expert grades, aggregated grades open-source software model) and the final project (adding new produced by Hamer’s algorithm show less validity than naive features to an open-source software project). averages. One potential reason is that Hamer’s algorithm uses the For each assignment, students have to write at least two peer square to amplify the differences between peer assessment grades assessments by completing different kinds of review rubric and aggregated grades during each iteration, which makes questions. Furthermore, they can do more peer assessments for assessors’ reputations have larger variance and degrades the extra credit. Several policies are proposed to avoid students performance of Hamer’s algorithm. For Program 1, Spring 2016, playing the system. In summary, each Wikipedia contribution it is very likely that the instructor-defined test cases in this artifact received 9 peer assessments on average; each Program 1 semester are not as elaborated as those in last semester (the artifact received 15 peer assessments on average; for OSS project, average absolute bias of Program 1 in Fall 2015 is much less than the number is 13. For final project, 20 assessors evaluated each those in Spring 2016). artifact on average. Table I Comparison of differences among aggregated grades from Hamer’s and Lauw’s algorithm and naive averages Hamer’s Lauw’s Naive Hamer’s Lauw’s Naive Assgt. name Metric Assgt. name Metric alg. alg. average alg. alg. average Wikipedia Avg. abs. bias 4.62 3.49 3.50 Wikipedia Avg. abs. bias 2.91 3.15 3.17 contrib., contrib., Fall 2015 RMSE 6.13 4.71 4.72 Spring 2016 RMSE 3.61 3.89 3.94 Prog. 1, Avg. abs. bias 4.32 5.58 6.21 Prog. 1, Avg. abs. bias 11.46 10.77 10.59 Fall 2015 RMSE 5.84 7.59 8.19 Spring 2016 RMSE 13.06 12.46 12.36 OSS project, Avg. abs. bias 5.30 6.55 7.29 OSS project, Avg. abs. bias 5.22 6.90 7.00 Fall 2015 RMSE 6.49 7.46 8.06 Spring 2016 RMSE 6.12 8.47 8.57 Avg. abs. bias 4.64 5.93 6.27 Final Avg. abs. bias 4.65 5.91 6.07 Final project, project, Fall 2015 RMSE 7.52 8.70 9.03 RMSE 5.91 7.48 7.61 Spring 2016 The instructor-defined test case is more like a test case in software This test shows that there are different fixed points in this dataset. engineering, is a set of conditions to check whether an application If we use 1 as initial input, it will often lead to a “reasonable” is working as it was originally designed [7]. The purpose of fixed point, but not always [2]. Instead, if we have prior instructor-defined test cases is to help students to understand the knowledge about which assessor might be credible or not, we requirements of the certain assignment and also help teaching staff should make use of this prior knowledge and the algorithm may to grade the students’ artifacts. For instance, “Can an admin delete converge to a more reasonable fixed point accordingly. other admins other than himself and the preconfigured admin?” is Thus, we tried to use different initial inputs to obtain more an instructor-defined test case. This question was used in both accurate aggregated grades. Instead of assigning a random initial review rubric and expert grading stage. By manually testing a reputation to each student, considering other available data, such series of instructor-defined test cases, teaching staff and peer as reputations from another review round, calibration results [9] assessors are able to decide the grades. Instructor-defined test or reputations from former assignments can be better ways to cases are used a lot in Program 1. Because Program 1 is not a achieve more reasonable results. topic-based assignment and all students are required to build web applications with same functionalities, it is easier for the Table II. Scores assigned by four peer assessments instructor to create such test cases comparing with assignments to four artifacts with different topics. Hamer's algorithm is iteration-based, which means the algorithm Peer Assessor will take several iterations before a solution (fixed point) is assessment reached. However, results generated by these two algorithms can grade a b c d be locally optimal solutions, instead of a globally optimal solution [2]. It means the result of each algorithm can be optimal within a Artifact 1 10 9 10 8 neighboring set of candidate solutions, instead of the optimal solution among all possible solutions [8]. One reason is that the 2 7 6 8 7 initial reputation assigned to each peer assessor is always equal to 3 7 - 2 4 1, which mandatorily sets each peer assessor’s ability the same at the very beginning. 4 6 7 3 3 In order to verify that different initial inputs will lead to different fixed points, we assembled a very small set of peer assessment Table III. Reputations with initial input records shown in Table II. There are four peer assessors (a, b, c, equal to 1 and other values d) who assessed four artifacts (1, 2, 3, 4). To make the dataset more similar to a real scenario, we assumed that assessor b did not Assessor Rep. values with init. Rep. values with init. assess artifact 3. rep. all eq. to 1 rep. not all eq. to 1 We used two sets of data as initial inputs for Hamer’s algorithm, whose reputation range is [0, ∞). The first set of initial input is the 1 0.50 2.66 same as the default setting of Hamer’s algorithm, 1 for all assessors; the second set of initial input is arbitrarily chosen, that 2 0.77 2.67 is, 0.2 for assessor 3, and 1 for the rest. The final reputations are shown in Table III. As you can see, we obtained two sets of data 3 2.00 0.42 with totally different results. It is obvious that different initial inputs affect the final results. 4 2.59 0.79 3.3 Research Questions as initial input of the subsequent real assignment, can produce This part presents three research questions. By answering these more accurate aggregated grades. questions, we can figure out whether replacing the initial input of Hamer’s algorithm from 1 to some other available data in the 3.3.3 Can reputation be taken from one assignment same course will obtain more accurate results. to a different one? In our dataset, there are four real assignments in a fixed order in 3.3.1 Can reputation be taken from a formative each semester. We hypothesized that the assess credibility of the review round to a summative review round? same assessor on one assignment and a subsequent one are related Since Fall 2015, Expertiza has allowed different rubrics to be and reputations calculated from one assignment, if used as initial used in each round of review. For each assignment with this input of the subsequent assignment, can produce more accurate feature, students were encouraged to finish two rounds of peer aggregated grades. assessments - a formative review round and a summative review What’s more, since we have already classified all assignments into round. During the formative review round, the teaching staff three categories, we also assumed that using reputations from one presented an elaborate formative rubric to peer assessors. Two assignment as the initial input of a subsequent assignment of the questions asked in formative rubric are presented below. “Rate same category can also produce more accurate aggregated grades. how logical and clear the organization is. Point out any places where you think that the organization of this article needs to be improved.”. “List any related terms or concepts for which the 4. EXPERIMENTS AND ANALYSIS writer failed to give adequate citations and links. Rate the According to three questions listed in the last section, we did helpfulness of the citations.” The purpose of these questions is to corresponding experiments to verify derived hypotheses. Since in encourage peer assessors to look into the artifact, point out the data verification section, Hamer’s algorithm outperforms Lauw’s problems and offer insightful suggestions [10]. After one assessor algorithm for six out of eight assignments, we only displayed the submitted formative feedback, Expertiza calculated the assessment reputation results from Hamer’s algorithm in experiments. grade based on scored questions in the formative rubric. 4.1 Can reputation be taken from a formative After that, authors will have a chance to modify their work according to information given by their peers. In the summative review round to a summative review round? review round, teaching staff offered a summative rubric which is Table IV shows the differences among aggregated grades, naive designed to guide peer assessors to evaluate the overall quality of averages and expert grades on assignments in CSC 517, Spring artifacts and check whether authors made the changes they 2016 by using two metrics (average absolute bias and RMSE). suggested in the formative review round. Below are two questions The reason why we chose CSC 517, Spring 2016 is because all used in the summative rubric. “Coverage: does the artifact cover assignments in this course support two rounds of peer all the important aspects that readers need to know about this assessments. topic? Are all the aspects discussed at about the same level of We found that using reputation results from the formative review detail?”. “Clarity: Are the sentences clear, and non-duplicative? round as initial input does not work well in all assignments. Does the language used in this artifact simple and basic to be Among four assignments, two of them (Program 1, Spring 2016 understood?” After assessors submitted their summative feedback, and final project, Spring 2016) saw improvement by using this Expertiza calculated the assessment grades again for each artifact method. One of them (Wikipedia contribution, Spring 2016) received new feedback. converged to the same fixed point as using 1 as initial input. Since We hypothesized that the assess credibility of the same assessor Hamer’s algorithm is iteration-based, it is possible that different on formative and summative review round are related and initial inputs converge to the same fixed point. The last one (OSS reputations calculated from the formative review round, if used as project, Spring 2016) fared even worse with alternative initial initial input of the summative review round, can produce more input. Overall, we were not able to draw the conclusion that accurate aggregated grades. whether initial input from formative review round is a good input option of Hamer's algorithm. 3.3.2 Can reputation be taken from calibration to One potential reason is that according to first author’s master’s real assignments? thesis, peer assessment records from formative review round would generate less accurate aggregated grades comparing with At the beginning of the Spring 2016 semester, we created a peer assessment records from summative review round. It is Wikipedia calibration assignment before real assignments. The because during the formative review round, peer assessors were instructor selected several representative artifacts from former encouraged to offer suggestions, and authors might make changes semesters. These artifacts had major differences in quality. Then before the summative review round. Hence, it is possible that peer the instructor submitted an expert peer assessment based on the assessments based on the initial version of products were not same review rubric that students would use for each artifact. accurate. However, during the summative review round, artifacts During the class, students assessed those artifacts on Expertiza. were unchangeable, and the same version as the one teaching staff After that, Expertiza generated the report for both the instructor graded. Therefore, it is reasonable that when comparing with and students. According to the report, the instructor analyzed the expert grades, peer assessments during the summative review results and helped students enhance their peer assessment skills. round could have higher validity than those during the formative We hypothesized that the assess credibility of the same assessor review round. And initial input from formative review round on the calibration assignment and the real assignment are related cannot help to improve the accuracy of aggregated grades. and reputations calculated from the calibration assignment, if used Table IV Comparison of differences between aggregated grades from Hamer’s algorithm with initial input equal to 1 and from formative review round Initial input from Assgt. name Metric Initial input equal to 1 Naive average formative review round Avg. abs. bias 2.91 2.91 3.17 Wikipedia contribution, Spring 2016 RMSE 3.61 3.61 3.94 Avg. abs. bias 11.46 11.35 10.59 Program1, Spring 2016 RMSE 13.06 12.96 12.36 Avg. abs. bias 5.22 5.29 7.00 OSS project, Spring 2016 RMSE 6.12 6.16 8.57 Avg. abs. bias 4.65 4.54 6.07 Final project, Spring 2016 RMSE 5.91 5.77 7.61 Table V Comparison of differences between naive averages and aggregated grades from Hamer’s algorithm with initial input equal to 1 and from calibration assignment Wikipedia contribution, Spring 2016 Different sets of aggregated grades Avg. abs. bias RMSE Hamer’s alg. with initial input equal to 1 2.91 3.61 Hamer’s alg. with initial input from calibration 2.80 3.51 Naive averages 3.17 3.94 4.2 Can reputation be taken from calibration 4.3 Can reputation be taken from one to real assignments? assignment to a different one? In this experiment, we further tested whether the aggregated In this experiment, we tried to test the hypothesis that using grades can be improved by using calibration results as initial reputations from former assignments as initial input will get more input. Since we trialed calibration assignment for Wikipedia accurate aggregated grades. Both course CSC 517, Fall 2015 and contribution only in Spring 2016 semester, for this hypothesis we CSC 517, Spring 2016 have Wikipedia contribution, Program 1, did the experiment based on data from Wikipedia calibration, OSS project and final project. And these four assignments are in Spring 2016 and Wikipedia contribution, Spring 2016. fixed order. To verify this hypothesis, we designed three sub- There was only one round of peer assessment in Wikipedia experiments separately on these two courses. The first sub- calibration, Spring 2016. And we used the formative review rubric experiment is based on Wikipedia contribution and Program 1. In in this assignment just the same one used in Wikipedia this sub-experiment, we used initial input from the Wikipedia contribution, Spring 2016. After assessors submitted their contribution assignment and peer assessment records from feedback, Expertiza computed the assessment grades for Program 1 to compute the aggregated grades. We compared these representative artifacts. Then the instructor submitted expert peer results with aggregated grades that based on the initial input equal assessments based on the same formative review rubric. After that, to 1. The second sub-experiment was designed between Program we calculated each assessor’s reputation value based on their 1 and OSS project. The same process as the first sub-experiment, assessment grade and expert grade. When Wikipedia contribution, we produced aggregated grades with the initial input from the Spring 2016 finished, we used reputation values produced from Program 1 and peer assessment records from OSS project. Then calibration assignment as the initial input to compute a new set of we compared these aggregated grades with those grades calculated reputation values. We compared this new set of reputation values with the initial reputation equal to 1. The third sub-experiment with reputation values calculated based on initial reputation equal was between OSS project and final project. The results are shown to 1. in Table VI. Table V shows that both average absolute bias and RMSE are Table VI shows that in Fall 2015, aggregated grades with initial decreased by using Hamer’s algorithm with calibration results as input from former assignments have higher validity than those initial input. However, data used in this experiment is quite grades produced by initial input equal to 1. For Spring 2016, we limited. If we want to further verify the efficacy of the calibration found that among three sub-experiments, one of them (sub- process, more data and more experiments are needed. experiment between Wikipedia contribution, Spring 2016 and Program 1, Spring 2016) converged to the same fixed point as using 1 as initial input, and another one (sub-experiment between Program 1, Spring 2016 and OSS project, Spring 2016) became Table VI Comparison of differences between aggregated grades from Hamer’s algorithm with initial input equal to 1 and from former assignments Naive Method Metric Hamer’s alg. Method Metric Hamer’s alg. average Wiki → Avg. abs. bias 4.13 Avg. abs. bias 4.32 6.21 Prog 1, Fall 2015 Prog 1, Fall 2015 RMSE 5.76 RMSE 5.84 8.19 Wiki → Avg. abs. bias 11.46 Prog 1, Spring Avg. abs. bias 11.46 10.59 Prog 1, Spring 2016 RMSE 13.06 2016 RMSE 13.06 12.36 Prog 1 → Avg. abs. bias 5.08 Avg. abs. bias 5.30 7.29 OSS, Fall 2015 OSS, Fall 2015 RMSE 6.31 RMSE 6.49 8.06 Prog 1 → Avg. abs. bias 5.41 OSS, Spring Avg. abs. bias 5.22 7.00 OSS, Spring 2016 RMSE 6.36 2016 RMSE 6.12 8.57 OSS → Avg. abs. bias 4.52 Avg. abs. bias 4.64 6.27 Final, Fall 2015 Final, Fall 2015 RMSE 7.46 RMSE 7.52 9.03 OSS → Avg. abs. bias 4.55 Final, Spring Avg. abs. bias 4.65 6.07 Final, Spring 2016 RMSE 5.81 2016 RMSE 5.91 7.61 Table VII Comparison of differences among aggregated grades from Hamer’s algorithm with initial input equal to 1 and from Wikipedia contribution and Program 1 OSS project, Fall 2105 Different sets of aggregated grades Avg. abs. bias RMSE Hamer’s alg. with initial input equal to 1 5.30 6.49 Hamer’s alg. with initial input of writing section from Wikipedia contribution 3.32 4.44 and initial input of programming section from Program 1. Naive averages 7.29 8.01 OSS project, Spring 2016 Different sets of aggregated grades Avg. abs. bias RMSE Hamer’s alg. with initial input equal to 1 5.22 6.12 Hamer’s alg. with initial input of writing section from Wikipedia contribution 4.45 5.12 and initial input of programming section from Program 1. Naive averages 7.00 8.57 worse by using initial input from former assignments. The last project, Spring 2016) obtained even worse results by using this sub-experiment (between OSS project, Spring 2016 and final new method. One potential explanation is that Program 1 is a project, Spring 2016) supported our hypothesis. In general, initial programming assignment, but the OSS project combines writing input from former assignments obtained equal or better results section with programming section. During the grading of OSS than initial input equal to 1 in five out of six experiments. project, teaching staff gave scores to writing section and Therefore, we believe that initial input from former assignments programming section separately. These two scores are related, but can increase the accuracy of aggregated grades. not always direct proportional with each other. If one team did a good job on programming and wrote the writing section Although the new method introduced above (initial input from perfunctorily, they would get a low score of writing section former assignments) can have better performance in most cases, regardless of their high score on programming section. However, the improvement is limited. Most of the time, it only improved by if another team did not accomplish the programming section well, less than 0.5 points in average absolute bias. What’s more, one they would not receive a high score on writing section in most of sub-experiment (between Program 1, Spring 2016 and OSS the time. Both writing and programming scores are on a scale of 0 to 100. The final score of OSS project is the combination of these Since this team did not commit new code or did not commit two scores with corresponding weights defined by teaching staff. promptly, the average of this question is 3.58 out of 5, which means on average more than 4 points will be taken off from the In order to verify the effect of assignment categories and try to total score. Only these two questions have already deducted more obtain more improvement, we designed a new experiment by than 12 points from the total score. using initial input from both the Wikipedia contribution and Program 1 acting on peer assessment records from OSS project. What’s more, only 3 out of 31 artifacts got the grades lower than That’s to say, the initial input of OSS project writing section came 90 and this one got 91. It is obvious that teaching staff also from Wikipedia contribution and the initial input of OSS project considered it is not a quite successful artifact. However, a programming section came from Program 1. Furthermore, we also relatively tolerant grade is still assigned to this team. So it can be combined the aggregated grades of writing section and the reason why there are large differences between expert grades programming section with the same weights used for producing and aggregated grades. A new grading method or newly-designed the final expert grades of OSS project. Table VII presents the rubric may help to solve this problem. experiment results in both Fall 2015 and Spring 2016. Comparing with the results produced from initial input equal to 1, average 6. CONCLUSIONS absolute bias is decreased by more than 1.3 points on average by In this paper, we propose several novel methods to improve the using this new method. It is a big improvement, which indicates accuracy of aggregated grades generated by reputation algorithms. that assignment categories should be made into consideration in Since Hamer’s and Lauw’s algorithms are iteration-based, we future work. tried different sets of initial inputs in order to get aggregated grades with least biases. 5. DISCUSSION We designed three experiments. The first one was to use After three experiments, we found that average absolute biases of reputations from the formative review round as initial input into some assignments are still high even using our new method, summative review round peer assessment records. Comparing which means that there still exist some obvious differences with the initial input equal to 1, this method cannot help us to between expert grades and aggregated grades. And there must be obtain aggregated grades with higher accuracy. Since after the some other issues also affecting the aggregated grades and not formative peer assessment stage, authors have a chance to modify being considered into these algorithms, such as mediocrely- their work, which makes the peer assessments in formative review designed rubrics, insufficient peer-review training, etc. round inaccurate. Then we figured out that the mediocre design of review rubrics The second experiment was to use the reputations from calibration and teaching staff’s idiosyncratic grading style may help to assignment as initial input. The result shows that this method can explain these high biases. For instance, OSS project, Fall 2015 is help us to get more accurate aggregated grades. However, lots of an assignment with both formative review round and summative questions are needed to answer to further verify the efficacy of the review round. Its summative rubric has seven questions. Each calibration process. For example, when should we let assessors question in this rubric has the same weight and Expertiza uses the perform calibration process, at the beginning of the course or naive average as the final grade, which means that each question before each real peer assessment stage? How many calibration will affect more than 14% of final grade. processes do we need, just one or one for each assignment One OSS artifact got 91 for the expert grade but only got category? What content should be included in calibration? By approximately 75 for aggregated grade. The final comments given answering these questions, we can have a deeper understanding of by teaching staff is calibration process and help to improve the quality of peer assessment. “Well, from the video they did the thing we expect them to do, but their tests are failing, and they should have fixed them.” The last experiment focused on initial input taken from former assignments. The results supported our hypothesis that aggregated In summative rubric, there is a test-related question grades calculated in this way can outperform naive averages. We also verified that under certain circumstances, considering “IF it is an Expertiza project, check the pull request. Did the assignment categories will improve the accuracy of aggregated build pass in Travis CI? Was there any conflict that must be grades a lot. resolved? You can check those on the pull request on GitHub. Ignore this question if it is not an Expertiza project.” Our new methods can help to improve the performance, but the absolute averages biases of some assignments are still high. After According to 13 valid peer-review records, most peer reviewers looking into this, we figure out that there is still room for us to were able to figure out this problem. And the average score of this improve our review rubric to resolve ambiguity and provide more question is 2.16 out of 5, which means that on average more than guidance to students (e.g. training or calibration). What’s more, 8 points will be deducted from total score since the code did not both Hamer’s and Lauw’s algorithm are rating-based. Some other pass the TravisCI. educational peer review systems, such as Critviz 2 and Mobius SLIP 3 , measure the qualities of peer assessments based on And during grading, teaching staff almost did not consider ranking. A different set of results might be found if we use another question in this rubric. That is ranking-based algorithms. We hope we can solve these issues, use “Check the commits. Was new code committed during the 2nd round?” 2 https://critviz.com/ 3 http://www.mobiusslip.com/ different kinds of algorithms and obtain aggregated grades with [4] K. Cho, C. D. Schunn, and R. W. Wilson, “Validity and even higher accuracy in the future. Reliability of Scaffolded Peer Assessment of Writing From Instructor and Student Perspectives,” J. Educ. Psychol., vol. 98, no. 4, pp. 891–901, 2006. 7. ACKNOWLEDGMENTS [5] Y. Song, Z. Hu, and E. F. Gehringer, “Pluggable reputation The Peerlogic project is funded by the National Science systems for peer review: A web-service approach,” in IEEE Foundation under grants 1432347, 1431856, 1432580, 1432690, Frontiers in Education Conference (FIE), 2015. 32614 and 1431975. 2015, 2015, pp. 1–5. [6] E. F. Gehringer, L. M. Ehresman, S. G. Conger, and P. A. 8. REFERENCES Wagle, “Work in Progress: Reusable Learning Objects [1] E. F. Gehringer, “A Survey of Methods for Improving Through Peer Review: The Expertiza Approach,” in Review Quality,” in New Horizons in Web Based Learning, Proceedings. Frontiers in Education. 36th Annual Y. Cao, T. Väljataga, J. K. T. Tang, H. Leung, and M. Conference, 2006, pp. 1–2. Laanpere, Eds. Springer International Publishing, 2014, pp. [7] Wikipedia. (2016). Test case. [online] Available: 92–97. https://en.wikipedia.org/wiki/Test_case. [2] J. Hamer, K. T. K. Ma, H. H. F. Kwong, K. T. K, M. Hugh, [8] Wikipedia. (2016). Local optimum. [online] Available: and H. F. Kwong, “A Method of Automatic Grade https://en.wikipedia.org/wiki/Local_optimum. Calibration in Peer Assessment,” in of Conferences in [9] Y. Song, E. F. Gehringer, J. Morris, J. Kid, and S. Ringleb, Research and Practice in Information Technology, “Toward Better Training in Peer Assessment: Does Australian Computer Society, 2005, pp. 67–72. Calibration Help?,” presented at the EDM 2016, CSPRED [3] H. Lauw, E. Lim, and K. Wang, “Summarizing Review workshop, 2016. Scores of ‘Unequal’ Reviewers,” in Proceedings of the 2007 [10] S. Yang, Z. Hu, Y. Guo, and E. F. Gehringer, “An SIAM International Conference on Data Mining, 0 vols., Experiment with Separate Formative and Summative Society for Industrial and Applied Mathematics, 2007, pp. Rubrics in Educational Peer Assessment,” IEEE Front. 539–544. Educ. Conf. FIE 2016, 2016.