Prediction of Grades for Reviewing with Automated Peer- review and Reputation Metrics Da Young Lee, Ferry Pramudianto, Edward F. Gehringer North Carolina State University Raleigh, NC 27695 [dlee10, fferry, efg]@ncsu.edu ABSTRACT To address this issue, this study aims at investigating methods Peer review is an effective and useful method for improving to help identify good reviewers who write high-quality reviews. To students’ learning through review by student peers. Peer review has attain this goal, we examine factors that may influence review been used in classes for several decades. To ensure the success of scores and propose a model to predict how good the reviewers are peer review, research challenges such as the quality of peer review based on the reviews written by them using machine learning must be addressed. It is challenging to identify how good the algorithms. reviewer is. We develop a prediction model to assess students’ We investigate several important factors that influence reviewing capability. We investigate several important factors that instructor-assigned grades, especially reviewers assigning scores influence students’ reviewing capability, which corresponds to behaviors for instructor-assigned grades. In this paper, we refer instructor-assigned grades for reviewing. We use machine learning instructor-assigned grades (i.e., grades) as the students’ reviewing techniques algorithms to build models for grade prediction for capability score assigned by the instructor. Another factor is reviewing. Our models are based on the several metrics such as the automated peer-review metrics, which are text metrics [5, 6] such reviewer assigning different scores to different rubric items and as tone for assessing the textual feedback given by the reviewers. automated metrics to assess the textual feedback given by the The other factor is a reputation metric [11] to determine who is reviewers. To improve the models, we also use reputation score of good reviewer based on history review scores across artifacts. This students’ as reviewers. We present results of experiments to show reputation metric is calculated based on the measure of the the effectiveness of the models. reviewer's leniency (“bias”). In this paper, we first investigate strong/weak correlation between reviewers with high reviewing capability and spread Keywords between scores. Note that the spread between scores corresponds to deviation for reviewer assigning different scores described in Peer reviews, rubrics, prediction model Section 3.3. We then investigate whether development of a model 1. INTRODUCTION based on reviewer assigning different scores would be effective for Peer review [1, 4] is an effective and useful method for predicting how good the reviewer is. For this task, we apply improving students’ learning by reviewing peer students’ work. machine learning techniques such as a decision tree [12] and k- Peer review has been used in classes for several decades. In recent Nearest Neighbors [13] algorithms to build a model for prediction. years, peer review has been used not only for traditional classes but We then investigate whether this model incorporating textual also for online courses such as Massive Open Online Courses feedback shows positive results for predicting how good the (MOOC) [4]. For example, in Coursera [2], several online courses reviewer is. Lastly, we investigate whether our model combined are offered, in which thousands of students from around the world with text metrics and reputation scores shows positive results for are enrolled. In such cases, instructors are not able to give feedback predicting how good the reviewer is. For these tasks, we investigate to such a large number of students in a timely manner. Therefore, following research questions: development of peer review methods based on observing peer  RQ1: Is there correlation between reviewer assigning different behaviors is important, and the technology should be improved to scores (i.e., “spread” between scores) to different rubric items, be more reliable and useful to users. and instructor-assigned grades? The classroom peer review process is as follows. Students  RQ2: How well does our model, based on reviewer assigning submit their assignments. Reviewers (peer students) provide different scores, predicts instructor-assigned grades? reviews of the assignments. The students have a chance to improve their submitted work by incorporating scores and comments in the  RQ3: How well does our model combined with text metrics of reviews. Because reviewers in educations are peer students, they reviews predict instructor-assigned grades? may lack sufficient peer reviewing experience. Therefore, they need to be guided through the peer review process to ensure the  RQ4: How well does our model combined with text metrics and provision of high-quality reviews. reputation scores of reviewers predict instructor-assigned grades? The assessment of reviews is a challenging problem in education. Meta-reviewing is a manual process [5] where instructor The rest of the paper is organized as follows. In Section 2, we might assign grades and provide feedback as a measure of the briefly introduce peer review process and peer review system, students’ reviewing capability. The problem is that the manual called Expertiza [3]. In Section 3, we describe our methodology for process of meta-reviewing is tedious and time-consuming. the study. In Section 4, we present our experimental results. Finally, 3.1 Data we give concluding remarks in Section 5. 3.1.1 Data Collection 2. BACKGROUND We assemble peer-review data from Expertiza [3]. This tool is This section discusses background for this study. a web-based educational learning application that helps students review peers’ work. We analyze 703 records submitted by students 2.1 Peer Review System: Expertiza where the students are assigned to grade assignments of peers. There are many tools to help peer-review process [3, 7, 8]. The data set is collected from two graduate-level courses: CSC Expertiza is a web-based education system where a feature for 517 (Object-Oriented Design and Development) and CSC 506 enabling peer reviews is integrated. This feature is a part of an (Architecture of Parallel Computers). Both are offered at North active learning process from peer students. Carolina State University. For example, in CSC 517, programming Using Expertiza, in classes, students are able to select tasks assignments and writing assignments are used for peer reviews. from assignment list. After students complete their tasks, they These assignments are team–based assignments where more than submit their outputs to receive reviews from peers in the peer- two students collaborate together. We use six review assignments review system. The submissions will be reviewed by anonymous where four of six are related to writing and results and two out of peers who can provide helpful comments and give scores based on six are related to programming assignments. rubrics. Researchers have worked on peer review systems for In this study, instructors manually assess submitted reviews decades. Researchers improved Expertiza for effective learning and assign scores within one review period where each student may management systems and peer-review systems. review multiple submissions. A final grade is given based on the Students expect to receive author feedback. Typically, a students’ submission and the quality of their reviews when double-blinded review process makes difficult for students to assessing their peers’ submissions. explain what they have done, especially when reviewers may misunderstand the contents of the submissions and give low grades. 3.1.2 Data Preparation In Expertiza, peer review may have multiple rounds where the Data cleaning process is required before we process data reviewers give feedback for improvements and check if the analytics, which includes combining multiple Database and Excel suggestions have been implemented in next round. Each round have tables based on the user’s id using SAS. During this process, we its several deadlines which are useful for organizing reviewing and remove entries where numeric scores are 0 or NULL, which resubmission. indicate empty. Invalid numeric scores can be assigned when students dropped their courses and did not assign scores on In Expertiza, the functionality for supporting wikis is submissions of peer students. In addition, a rubric may require only integrated for collaboration among students. Also, for submissions, textual feedback, which is not included in this study. students may use a wiki, which is very helpful in supporting collaborative work in writing assignments. These wikis provide 3.2 Research Questions several features for easy editing and keeping track of the past We investigate several important factors that influence edition. instructor-assigned grades, especially reviewers assigning scores behaviors for instructor-assigned grades. As we explained in 2.2 Peer Review Section 1, we refer instructor-assigned grades (i.e., grades) as the Each student can select more than one submission to review students’ reviewing capability score assigned by the instructor. within one assignment period. Each review consists of a review rubric to guide students in the completion of the review. Each rubric To study the usefulness of review quality assessment, we may include multiple questions, called criteria. Appendix A. is an investigate the following research questions: example of rubric, which consists of 12 rubric criteria. For example,  RQ1: Is there correlation between reviewer assigning different each question may ask for assessments of the organization, scores (i.e., “spread” between scores) to different rubric items, originality, grammar issues or clarity of a writing submission under and instructor-assigned grades? review. The rubric also asks whether the submission contains the acceptable quality of the definitions, examples, and links found in  RQ2: How well does our model, based on reviewer assigning the submission. different scores, predicts instructor-assigned grades? In the peer review process, reviewers often provide two kinds  RQ3: How well does our model combined with text metrics of of feedback: quantitative (scores) and qualitative feedback. reviews predict instructor-assigned grades? Reviewers measure numeric scores for certain rubric criteria. In other words, after the reviewers read the rubric, they submitted their  RQ4: How well does our model combined with text metrics and textual feedback and numeric scale scores for each criterion. reputation scores of reviewers predict instructor-assigned grades? For example, rubric criteria can be, “on a scale of 1 (worst) to 5 (best), how easy is it to understand the code?” Moreover, We describe more details about research questions. For RQ1, reviewers are often required to provide formative textual feedback reviewers may assign grades for multiple submissions within the where their comments incorporate issues identified, suggestions, same review. This research question investigates strong/weak and comments. As numeric scores may be helpful, but textual correlation between reviewers with high reviewing capability and feedback also gives more concrete ideas on the submissions. spread between scores. Note that the spread between scores is measured by weighted standard deviation described in Section 3.3. 3. METHODOLOGY RQ2 investigates whether development of a model based on This section discusses methodology for this study. reviewer assigning different scores would be effective for predicting how good the reviewer is. RQ3 investigates whether this model incorporating textual feedback shows positive results for - Summative content: This content type is positive feedback predicting how good the reviewer is. RQ4 investigates whether our or a summary of the submission. For example, "The page is model combined with text metrics and reputation scores shows organized logically" is classified into summative content. positive results for predicting how good the reviewer is. Note that - Problem-detection content: This content type Identifies we use a text analysis tool to automatically extract text metrics [ 5, problems in the submission. For example, "The page lacks a 6]. We measure text metrics for given textual feedback such as qualitative approach and an overview" is classified into content type, tone, and volume. problem-detection content. 3.3 Metrics - Advisory content: this content type provides suggestions to We utilize the following metrics to address research questions. the students for improving their work. For example, "The page could contain more ethics related links" is classified  Pearson Correlation Coefficients: Pearson Correlation into advisory content. Coefficients measures simple linear correlation between sets  Tone: reviews may include different tones, which refer to the of data. This shows a degree of how well they are related. semantic orientation of a text given words and presentation The correlation is measured as follows: written by reviewers. This metric classify contents into one of three tones: positive, negative or neutral. This metric is - Positive: A review is classified as having a positive tone when it contains positive feedback overall. For example, positive words or phrase such as “well-organized paper” and “complete” indicate positive semantic orientation. We measure the correlation between the reviewer assigning - Negative: A review is classified as having a negative tone different scores to different rubric items, and that reviewer being when it contains negative feedback overall. For example, given a high grade by the instructor. The correlation coefficient negative words or phrase such as “copied”, “poor”, and “not ranges between −1 to 1 where 1 implies perfect linear relation complete” indicate negative semantic orientation. between X and Y, and -1 implies that, when X values increases, Y - Neutral: A review is classified into a neutral tone when it is values decreases linearly. 0 implies no linear relation. contains neutral feedback and a mix of positive and negative  Weighted Standard Deviation: This weighted standard feedback. For example, a mix of positive and negative words deviation metric is measured as follows. or phrase such as “The organization looks good overall; however, we did not understand the terms."” indicate neutral semantic orientation: “looks good” can be positive and “did 1 ŵ √ ∑𝑛 2 𝑖=1(𝑥𝑖 − Μ) where standard deviation is a degree not understand” can be negative semantic orientation. 𝑛 to measure the spread of observed numbers (x1, x2, .., xn) in  Volume: reviews may include different words. This metric a data set with the mean value M of the observation numbers refers to the quantity of unique tokens in the review and weight ŵ. We measure this value to the degree of spread excluding stop words such as pronouns. of scores given by reviewers. ŵ is the number of reviews We use Lauw’s Reputation Score identified by Song et al. assigned to each reviewer within one assignment. [11]. Lauw-peer algorithm is based on the measure of the reviewer's  Average Number of Words (Avg. # Words): Given more leniency (“bias”), which can be either positive or negative. than one review comment, this metric is the average number  Lauw’s Reputation Score: this metric is the measure who is of words. good reviewer based on history review scores across We measure weighted standard deviation and average number artifacts. The reputation range calculated by the Lauw of words, which are used as inputs for machine learning algorithms algorithm is [0,1]. A reputation score close to 1 means the for predicting instructor-assigned grades. reviewer is credible.  Root Mean Square Error (RMSE): The RMSE between We measure text metrics, which are used as additional inputs predicted values and actual values is computed as square root for machine learning algorithms for predicting instructor-assigned of the mean of the squares of the deviations. grades.  Score Difference (Score Diff): This metric is the gap between predicted values and actual values. 3.4 Approach Machine learning approaches [12, 13] such as K-nearest RMSE and Score Diff are used to measure the effectiveness of neighbor, decision tree, and neural network are useful for models. Especially, if RMSE and Score Diff are larger, a prediction prediction. For our experiments, we use K-nearest neighbor and model is less effective. If RMSE and Score Diff are smaller, a decision tree, which are based on supervised learning [13]. We first prediction model is more effective for prediction. choose K-nearest neighbor classifiers because these ones are based on learning by analogy of input values. With this model, we can For determining the quality of the textual feedback, we use observe the closeness patterns based on Euclidean distance between various text metrics, which can be collected automatically via a text training and validation data sets. As this model is rooted on analysis tool [5, 6]. analogy, we use another approach, which is known high  Content Type: Identification of content types in reviews: performance on classification for categories of values. Decision reviews may include different types of contents. This metric tree model is a good fit for our input data set with different values. classifies contents into one of three categories: summative, In addition, this approach is useful for comprehensibility to show a problem-detection, or advisory content. tree structure. In this paper, we did not consider other machine Table 1. Examples of writing assignment reviews for a Table 2. Input data sets used for our decision tree and k- submission where numeric score and textual feedback Nearest Neighbors algorithm for predicting instructor- given by reviewers on assignments. Detailed questions assigned grades for reviewing. related to rubric criteria is found in Appendix A. Name Input Data Set Rubric Criteria Score Textual Feedback  Weighted Standard Deviation Base The organization is good and  Average Number of Words A1. Organization 5 nicely gives intro, features  Content Type and then examples of the Text  Tone framework  Volume They are clear and the A2. Clarity 4 Rep  Lauw’s Reputation scores language is easily understood No, I don't see any changes A3. Revision 2 from previous version learning algorithms SVM and Naïve Bayes because these require complicated calculation of formula and it is not easy to track how values are classified. We first propose a simple baseline model, which predicting instructor-assigned grades by the average grades for reviewing for a training data set. This baseline model is used to justify whether machine learning algorithms would be useful for grade prediction. We develop models based on machine learning algorithms, which can show better prediction than the simple baseline model. As Expertiza has been used for several years, we have sufficient data concerning students’ reviews and instructor grade Figure 1. Instructor-assigned grade distribution where X-axis assignment. We divide our data into a training set and a validation represents scores and Y-axis represents the number of review set. Our goal for using the machine learning approach is to predict assignments. instructor grade assignments. This problem is related to the reputation of peers. The peers who carefully review other students’ combine two approaches whether the combined models show better submissions are likely to receive higher scores. We first use a prediction. training data set for training our decision tree model and then apply the model on a validation set for score prediction. Consider that Alice gives scores and comments to Bob. Note that rubric criteria A1, A2, and A3 are found in the Appendix A. In this paper, we use a decision tree [12] to explore and model our data. Decision trees are typically used in operations research, Step 1. Collect a list of scores that students gave to especially for decision analysis. As one would like to predict assignments. Consider that Alice gave scores and comments to specific decisions related to scores, decision models are applied. Bob’s assignment. As shown in Table 1, Alice rated 5, 4, and 2 for We used an SAS tool, called JMP [12], in which machine learning the assignment of Bob. This score is used for calculating weighted approaches are integrated. In this tool, the partition function standard deviation. Weighted standard deviation is 4.5 considering recursively partitions data according to the relationship between that its standard deviation is 1.5 and its weight is 3. data sets. Given the relationship between X and Y values, a tree is Step 2. Collect list of comments given to a reviewer. The created determining how to generate a tree of partitions. In this number of word per each textual feedback is 15, 9, and 9. The total process, the tool automatically searches groups and continues count of words is 33. The average number of words is 11. splitting off separate groups. This process is conducted recursively until the tool reaches a specific desired fit. The tool stops when the Step 3. Consider that Alice’s reviewing score is 95 assigned prediction result is no longer improved. by the instructor. Using weighted standard deviation and the average number of words, we use a decision tree to predict this We also consider the k-Nearest Neighbors algorithm (or k- instructor-assigned grade. From the collected data set, we use 2/3 NN for short). The k-NN algorithm is a widely used algorithm of data as a training set and 1/3 of data as a validation data set. This among all machine learning algorithms. k-NN is a non-parametric decision tree is trained to predict the instructor-assigned grade. method used for classification and is supported by SAS/JMP tool. The input consists of the k closest training set, which contains Step 4. Using weighted standard deviation and the average similar features. k-NN is a type of learning to find approximated number of words, the k-NN algorithm is used to predict this locally similar classification. instructor-assigned grade. We set k with 10. We describe steps of our work using the example reviews in Step 5. Calculate reputation of a reviewer using Lauw’s Table 1. We compare automated metrics and reputation score Reputation Model. Then, this reputation score is converted to a assessed by reputation algorithm, called Lauw’s model for predicted instructor-assigned grades. predicting the instructor-assigned grades for reviewing. We also Step 6. Compare performance of prediction. RMSE and Score Diff are used to measure the effectiveness of approaches. Then, we also extend these models using different sets of Hypothesis 2: Our decision tree and K-nearest neighbor models metrics for decision tree and k-NN algorithm, especially, described based on reviewer assigning different scores are effective for in Table 2. predicting instructor-assigned grades. That is, decision tree and k- nearest neighbor models have smaller RMSEs than that of Lauw’s 4. RESULTS Reputation Model. In this Section, we investigate factors to influence instructor-assigned grades. Figure 1 shows instructor-assigned The purpose of this study is to investigate whether grade distribution. As shown in Figure 1, the grades are well development of models based on a reviewer assigning different distributed above 75. There are some low scores. scores would be effective for predicting instructor-assigned grades. The first step is to apply the decision tree model to partition 4.1 Effect of Reviewer Assigning Different data for the best performance. We divide the data into training set Scores and validation set. We assign 2/3 of a set as a training data set for We describe hypothesis 1 to answer RQ1. modelling. The remaining 1/3 of a data set is used as a validation Hypothesis 1: There is a strong correlation between a reviewer data set for comparison. The second step is to calculate the average assigning different scores to different rubric items, and instructor- difference and RMSE between actual grades and instructor- assigned grades. That is, a reviewer who is careful to consider what assigned grades based on the decision tree model. The third step is score a student should receive for each rubric item (and therefore to compare results with results gained from K-nearest neighbor, gives different scores for different rubric items) is likely to be baseline and prediction model based on reputation system, called assigned a higher grade by the instructor, than is a student who Lauw’s algorithm [11]. tends to assign the same score (e.g., 4 out of 5) to all or almost all When we use the decision tree and k-nearest neighbor models, rubric items. we employ two inputs: weighted standard deviations, and the The purpose of this study is to investigate the effect of average number of words for reviews given by students within one reviewer assigning different scores for grade prediction. This assignment. The output is the predicted grades. To compare research question investigates whether reviewers with high quality performance, we measure the average of the absolute value of score reviews may show some correlation between the scores assigned to difference between an actual grade and corresponding predicted different rubric items and grades. grade. RMSE is also used to measure how close the predicted values are to actual values. Note that, as grades vary from low to The first step is to find and collect review scores within the high, accurate grade prediction cannot be achieved with high same assignments per student. The second step is to calculate accuracy. Instead, we measure the average difference between weighted standard deviation from the list of scores. The third step actual grades and instructor-assigned grades. is to calculate the relationship between this deviation and instructor-assigned grades. Assumption is that student who give All available valid peer-review records are used in this scores differently would be more a careful reviewer. This student experiment. We measure that score difference range, average may receive higher grades for their review assignments. absolute bias and root mean square error (RMSE) in Tables 3 and 4. Tables 3 and 4 present the results from decision tree (DT) model, For the Pearson correlation, good fit is useful to predict an k-nearest neighbor (k-NN) model, baseline and Lauw’s Rep Model anticipated future rate. We assess the statistical significance using for different data sets inputs. For example, Base+Text means that statistical testing methods. In this context, we measure p-value with base and text input data sets in Table 2 are used. regard to those correlation models. The p-value represents the probability of satisfying a model. The p-value is considered to be We observe a case of base data set input only since this an estimate of the 'goodness of fit' of the model. Typically, the test research question is related to only base metric. For base data set of satisfying the model is statistically significant if the p-value inputs, we observe that the decision tree and k-nearest neighbor <0.05. We used SAS software for conducting this analysis. models have smaller RMSEs than that of baseline and Lauw’s Reputation models for writing and programming assignments. A Pearson product-moment correlation coefficient was Therefore, the decision tree and K-nearest neighbor models are computed to assess the relationship the deviation metric and scores. more effective for prediction in this case. DT model and k-NN There was a positive, but weak linear correlation between the two model are data-driven models, which assess input data and find the variables, r = 0.1, p = 0.03. Note that r is correlation coefficient and best fit to correlate these inputs with an output. We observe that p is p-value. As r is small, we observe that there is a positive, but Lauw’s Rep Model is dependent on data sets. For example, the weak correlation between a reviewer assigning different scores to range of [0,1] is useful for reputation score. However, if one different rubric items, and instructor-assigned grades. Note that this receives 0 as a reputation score, then, she/he may be expected to shows only linear correlation. receive the lowest grade (e.g., 0), but this cannot happen because instructor consider many aspects other than reputation. In addition, as shown in Section 4.2, we use the metric, a reviewer assigning different scores to different rubric items, for We conclude that data supports hypothesis 2. building models because decision tree models with this metric are effective in terms of grade prediction. 4.3 Prediction of Instructor-Assigned Grades With regards to Pearson product-moment correlation, we for Reviewing using Text Metrics conclude that data does not support hypothesis 1. We describe hypothesis 3 to answer RQ3. Hypothesis 3: Our decision tree and k-nearest neighbor models based on additional text metrics are more effective for predicting 4.2 Prediction of Instructor-Assigned Grades instructor-assigned grades than the preceding models (based on for Reviewing reviewer assigning different scores). That is, decision tree and k- We describe hypothesis 2 to answer RQ2. Table 3. Experimental results for writing assignments based on our decision tree (DT) model, k-nearest neighbor (k-NN) model, baseline and Lauw’s Rep Model. The decision tree has a lower RMSE than the baseline and the Lauw’s reputation model, and the RMSE decreases each time the decision tree is refined. The k-nearest neighbor has a lower RMSE than the Lauw’s reputation model and the RMSE is similar each time the k-nearest neighbor is refined. Decision Tree K-Nearest Neighbors Base+ Lauw’s Base+ Base+ Base+ Base+ Base+ Baseline Base Base Text+Rep Rep Text Rep Text+Rep Text Rep Model Avg. Abs. Score 9.4 8.8 8.7 8.0 8.8 9.1 8.6 8.9 9.9 16.2 Diff RMSE 13.0 12.6 11.4 10.1 12.6 13.6 12.2 12.9 13.2 20.8 Avg. RMSE 11.7 14.4 13.2 20.8 Table 4. Experimental results for programming assignments based on decision tree (DT) model, k-nearest neighbor (k- NN) model, baseline and Lauw’s Rep Model. The decision tree has a lower RMSE than the baseline and the Lauw’s reputation model, and the RMSE is similar each time the decision tree is refined except one based Base+Text+Rep. The k-nearest neighbor has a lower RMSE than the Lauw’s reputation model and the RMSE tends to be increased each time the k-nearest neighbor is refined. Decision Tree k-Nearest Neighbors Base+ Lauw’s Base+ Base+ Base+ Base+ Base+ Baseline Base Base Text+Rep Rep Text Rep Text+Rep Text Rep Model Avg. Abs. Score 8.7 8.4 8.6 9.9 8.4 8.5 8.0 9.0 8.9 16.2 Diff RMSE 11.9 11.0 11.7 13.0 9.0 10.8 12.3 11.2 13.1 20.8 Avg. RMSE 11.9 10.8 13.1 20.8 nearest neighbor models based on additional text metrics have show substantial improvement. Review contents and tone smaller RMSEs. generated by our meta-review service are not highly analogous for the similar grades. We observe that some students may have higher We investigates whether our models with additional text review grades with negative tones and summary content. But other metrics derived from textual feedback show more effective results students might have higher review grades with positive tones and for predicting instructor-assigned grades than the preceding problem-detection content. However, K-nearest neighbor models models. In this study, we measure text metrics from textual cannot distinguish these cases. Additionally, K-nearest neighbor feedback: content, tone, and volume. models incorporated with more, yet unrelated variables may be less The purpose of this study is to investigate whether additional effective than those limited to selected and related variables. text metrics can be useful as predictive metric for improving Our results are dependent on which models would be used. We decision tree prediction results with regard to instructor-assigned conclude that data analyzed by decision tree models supports grades. The first step is to measure text metrics from reviews. We hypothesis 3. We conclude that data analyzed by the k-nearest create our models to partition data for best performance. For this neighbor models assignments does not support hypothesis 3. model, we use metrics such as weighted standard deviations, the average number of words, content type, tone, and volume. We 4.4 Prediction of Instructor-Assigned Grades divide the data into training and validation sets. The second step is to calculate the average difference between actual grades and for Reviewing using Text Metrics and instructor-assigned grades for our models. The third step is that we Reputation Models compare results with the one resulting from our models in the We describe hypothesis 4 to answer RQ4. Section 4.2. Hypothesis 4: Our decision tree and k-nearest neighbor models Tables 3 and 4 show the results our models with text metrics based on additional reputation scores improve prediction of for grade prediction. All available valid peer-review records are instructor-assigned grades. That is, decision tree and k-nearest used in this experiment. We observe the average score difference neighbor models based on additional reputation scores have and root mean square error (RMSE) in Tables 3 and 4. From this smaller RMSEs. result, when we compare RMSE results between base and base+text We investigates whether our models with additional text cases, we see that for the decision tree model, additional text metrics and reputation model scores shows positive results for metrics help improve the prediction power for grades. We see that predicting instructor-assigned grades. In this study, we measure for k-nearest neighbor model, additional text metrics do help text metrics from textual feedback: content, tone, and volume. improve the prediction power for grades. The purpose of this study is to investigate whether additional K-nearest neighbor model results are based on analogy, which reputation scores can be useful as predictive variables for is not be effective for prediction in this case. Volume is already improving decision tree prediction results with regard to instructor- accounted for in the number of words. Therefore, volume may not assigned grades. The first step is to calculate reputation scores [11] from reviews. We create our models to partition data for best performance. The second step is to calculate the average difference [3] Gehringer, E., "Expertiza: information management for between actual grades and instructor-assigned grades for our collaborative learning." Monitoring and Assessment in models. The third step is that we compare results with the one Online Collaborative Environments: Emergent resulting from the model in the preceding Section. Computational Technologies for E-Learning Support, pp 143-159, 2009. Tables 3 and 4 shows the results of grade prediction. All available valid peer-review records are used in this experiment. We [4] Kulkarni, Chinmay, Koh Pang Wei, Huy Le, Daniel Chia, observe the average absolute score difference and root mean square Kathryn Papadopoulos, Justin Cheng, Daphne Koller, and error (RMSE) in Tables 3 and 4. From this results, for writing Scott R. Klemmer. "Peer and self assessment in massive assignments, the decision tree model with base+text+rep data online classes." In Design Thinking Research, pp. 131-168. inputs is the most effective in terms of RMSE. We infer that Springer International Publishing, 2015. reputation score helps improve the performance of grades [5] Ramachandran, L. and Gehringer, E., “Automated prediction in this case. However, for programming assignments, assessment of review quality using latent semantic the decision tree model with base+text+rep data inputs is not the analysis,” 11th IEEE International Conference on Advanced most effective in terms of RMSE. The reasons would be for Learning Technologies, 2011. programming assignments, the focus of reviewing is to check the correctness of program behaviors and requirements with shorter [6] Ramachandran, L. and Gehringer, E., “An automated textual feedback compared with ones from writing assignments. approach to assessing the quality of code reviews,” American Society for Engineering Education, San Antonio, TX, 2012. Our results are dependent on which assignments would be [7] Margerum, L., Gulsrud, M., Manlapez, R., "Application of used. We conclude that our data does partially support hypothesis calibrated peer review (CPR) writing assignments to enhance 4: the decision tree models with writing assignments supports experiments with an environmental chemistry focus." J. hypothesis 4, but the decision tree models with programming Chemical Education 84, no. 2 (2007): 292. assignments does not support hypothesis 4. [8] Luca de Alfaro and Michael Shavlovsky. CrowdGrader: a 5. CONCLUSIONS AND FUTURE WORK tool for crowdsourcing the evaluation of homework Peer review is an effective and useful method for improving assignments. Proc. 45th ACM technical symposium on students’ learning by reviewing peer students’ work. The quality of Computer science education (SIGCSE '14). ACM, pp 415- peer reviews is important when guiding students. To improve the 420, 2014. quality of peer reviews, instructors grade their reviews based on [9] Jonsson, A. and Svingby, G., The Use of Scoring Rubrics: students’ scores and feedback. However, this process is manual, Reliability, Validity and Educational Consequences and automated decisions would be helpful. Prediction of the Educational Research Review, v2 n2, pp130-144, 2007 instructor-assigned grades is a complex and challenging problem in peer review systems. We used machine learning techniques [10] Song, Y., Hu, Z. and Gehringer, E.F., Closing the Circle: algorithms to build models for grade prediction for reviewing. Use of Students' Responses for Peer-Assessment Rubric Experimental results showed that the decision tree model and K- Improvement. Proc. Advances in Web-Based Learning-- nearest neighbor (k-NN) model are more effective than Lauw’s ICWL 2015, pp 27-36, 2015 Repudiation Model in terms of RMSE. We also compared the [11] Song, Y., Hu, Z. and Gehringer, E.F., Pluggable reputation average RMSE values for the decision tree and k-NN models. systems for peer review: A web-service approach. FIE pp 1- Experimental results showed that the decision tree models (avg. 5, 2015 RMSE: 11.7) are more effective than k-NN models (avg. RMSE: [12] JMP Decision Tree Model 14.4) for writing assignments in terms of the average value of https://www.jmp.com/support/downloads/pdf/jmp11/Speciali RMSE. Experimental results showed that the k-NN models (avg. zed_Models.pdf RMSE: 10.8) are slightly more effective than decision tree models (avg. RMSE: 11.9) for programming assignments in terms of the [13] Phyu, Thair Nu. "Survey of classification techniques in data average of RMSE. Text metrics may be useful for classifying mining." In Proceedings of the International contents, but showed less effect on grade prediction. Future work MultiConference of Engineers and Computer Scientists, vol. includes the followings. First, we improve the prediction 1, pp. 18-20. 2009. capabilities of the model. We investigate any other metric to capture a certain feature of data, which can improve the performance. Second, we explore semantics of text, which also help 8. APPENDIX guide modelling with higher performance. Appendix A. Examples of Rubric Criteria of 6. ACKNOWLEDGMENTS Writing Assignments in CSC 517 This study is partially funded by the PeerLogic project under the National Science Foundation grants 1432347, 1431856, 1432580, 1432690, and 1431975. No Question Score Range A1 Organization: how logical and clear is the (Terrible 7. REFERENCES organization? organization) [1] Topping, K.. "Peer assessment between students in colleges 0 to 5 (Very logical and and universities." Review of educational Research 68.3 clear) (1998): 249-276. A2 Clarity: Are the sentences clear, and non- (Terrible [2] Cloudera: http://www.cloudera.com/, 2016 duplicative? Does the language used in this English artifact simple and basic to be understood? usage) 0 to 5 (Good 5 (adequate English citations) usage) A3 Did the authors revise their work in accordance (Not agree) 0 A12 Rate how logical and clear the organization is. (terrible with your suggestions? to 5 (Strong Point out any places where you think that the organization) agree) organization of this article needs to be 1 to 5 (very A4 Originality: If you found any plagiarism in (Several improved. logical and round 1, has it been removed? Then, randomly places of clear) pick some sentences or paragraphs and search plagiarism for them with a search engine. Describe any text spotted) 0 to Appendix B. Snapshot of Decision Tree Model that may infringe copyrights. 5 (No plagiarism for Writing Assignments for Base+Text+Rep spotted) Metrics A5 Coverage: does the artifact cover all the (Not agree) 0 important aspects that readers need to know to 5 (Strong about this topic? Are all the aspects discussed agree) at about the same level of detail? A6 Definitions: are the definitions of unfamiliar (Several terms clear and concise? Are the definitions definitions adequately supported by explanations or are missing examples? or incomplete) 0 to 5 (Strong agree) A7 References: do the major concepts have (Many more citations to more detailed treatments? Are there references any unavailable links? should be added) 0 to 5 (Strong agree) A8 List the unfamiliar terms used in this wiki. Are (neither those unfamiliar terms well defined or linked to defined nor proper references? linked) 1 to 5 (well defined or links are added) A9 Rate the overall readability of the article. (not readable Explain why you give this score. and confusing) 1 to 5 (readable and not confusing) A10 Rate the English usage. Give a list of spelling, (terrible grammar, punctuation mistakes or language English usage mistakes you can find in this wiki (e.g. usage) 1 to 5 ruby on rails -> Ruby on Rails). (good English usage) A11 List any related terms or concepts for which the (more writer failed to give adequate citations and citations are links. Rate the helpfulness of the citations. needed) 1 to