=Paper= {{Paper |id=Vol-1633/ws1-paper5 |storemode=property |title=Prediction of Grades for Reviewing with Automated Peer-review and Reputation Metrics |pdfUrl=https://ceur-ws.org/Vol-1633/ws1-paper5.pdf |volume=Vol-1633 |authors=Da Young Lee,Ferry Pramudianto,Edward F. Gehringer |dblpUrl=https://dblp.org/rec/conf/edm/LeePG16 }} ==Prediction of Grades for Reviewing with Automated Peer-review and Reputation Metrics== https://ceur-ws.org/Vol-1633/ws1-paper5.pdf

Prediction of Grades for Reviewing with Automated Peer-
review and Reputation Metrics
Da Young Lee, Ferry Pramudianto, Edward F. Gehringer
North Carolina State University
Raleigh, NC 27695
[dlee10, fferry, efg]@ncsu.edu

ABSTRACT To address this issue, this study aims at investigating methods
Peer review is an effective and useful method for improving to help identify good reviewers who write high-quality reviews. To
students’ learning through review by student peers. Peer review has attain this goal, we examine factors that may influence review
been used in classes for several decades. To ensure the success of scores and propose a model to predict how good the reviewers are
peer review, research challenges such as the quality of peer review based on the reviews written by them using machine learning
must be addressed. It is challenging to identify how good the algorithms.
reviewer is. We develop a prediction model to assess students’ We investigate several important factors that influence
reviewing capability. We investigate several important factors that instructor-assigned grades, especially reviewers assigning scores
influence students’ reviewing capability, which corresponds to behaviors for instructor-assigned grades. In this paper, we refer
instructor-assigned grades for reviewing. We use machine learning instructor-assigned grades (i.e., grades) as the students’ reviewing
techniques algorithms to build models for grade prediction for capability score assigned by the instructor. Another factor is
reviewing. Our models are based on the several metrics such as the automated peer-review metrics, which are text metrics [5, 6] such
reviewer assigning different scores to different rubric items and as tone for assessing the textual feedback given by the reviewers.
automated metrics to assess the textual feedback given by the The other factor is a reputation metric [11] to determine who is
reviewers. To improve the models, we also use reputation score of good reviewer based on history review scores across artifacts. This
students’ as reviewers. We present results of experiments to show reputation metric is calculated based on the measure of the
the effectiveness of the models. reviewer's leniency (“bias”).
In this paper, we first investigate strong/weak correlation
between reviewers with high reviewing capability and spread
Keywords between scores. Note that the spread between scores corresponds to
deviation for reviewer assigning different scores described in
Peer reviews, rubrics, prediction model
Section 3.3. We then investigate whether development of a model
1. INTRODUCTION based on reviewer assigning different scores would be effective for
Peer review [1, 4] is an effective and useful method for predicting how good the reviewer is. For this task, we apply
improving students’ learning by reviewing peer students’ work. machine learning techniques such as a decision tree [12] and k-
Peer review has been used in classes for several decades. In recent Nearest Neighbors [13] algorithms to build a model for prediction.
years, peer review has been used not only for traditional classes but We then investigate whether this model incorporating textual
also for online courses such as Massive Open Online Courses feedback shows positive results for predicting how good the
(MOOC) [4]. For example, in Coursera [2], several online courses reviewer is. Lastly, we investigate whether our model combined
are offered, in which thousands of students from around the world with text metrics and reputation scores shows positive results for
are enrolled. In such cases, instructors are not able to give feedback predicting how good the reviewer is. For these tasks, we investigate
to such a large number of students in a timely manner. Therefore, following research questions:
development of peer review methods based on observing peer  RQ1: Is there correlation between reviewer assigning different
behaviors is important, and the technology should be improved to scores (i.e., “spread” between scores) to different rubric items,
be more reliable and useful to users. and instructor-assigned grades?
The classroom peer review process is as follows. Students  RQ2: How well does our model, based on reviewer assigning
submit their assignments. Reviewers (peer students) provide different scores, predicts instructor-assigned grades?
reviews of the assignments. The students have a chance to improve
their submitted work by incorporating scores and comments in the  RQ3: How well does our model combined with text metrics of
reviews. Because reviewers in educations are peer students, they reviews predict instructor-assigned grades?
may lack sufficient peer reviewing experience. Therefore, they
need to be guided through the peer review process to ensure the  RQ4: How well does our model combined with text metrics and
provision of high-quality reviews. reputation scores of reviewers predict instructor-assigned
grades?
The assessment of reviews is a challenging problem in
education. Meta-reviewing is a manual process [5] where instructor The rest of the paper is organized as follows. In Section 2, we
might assign grades and provide feedback as a measure of the briefly introduce peer review process and peer review system,
students’ reviewing capability. The problem is that the manual called Expertiza [3]. In Section 3, we describe our methodology for
process of meta-reviewing is tedious and time-consuming.
the study. In Section 4, we present our experimental results. Finally, 3.1 Data
we give concluding remarks in Section 5.
3.1.1 Data Collection
2. BACKGROUND We assemble peer-review data from Expertiza [3]. This tool is
This section discusses background for this study. a web-based educational learning application that helps students
review peers’ work. We analyze 703 records submitted by students
2.1 Peer Review System: Expertiza where the students are assigned to grade assignments of peers.
There are many tools to help peer-review process [3, 7, 8].
The data set is collected from two graduate-level courses: CSC
Expertiza is a web-based education system where a feature for
517 (Object-Oriented Design and Development) and CSC 506
enabling peer reviews is integrated. This feature is a part of an
(Architecture of Parallel Computers). Both are offered at North
active learning process from peer students.
Carolina State University. For example, in CSC 517, programming
Using Expertiza, in classes, students are able to select tasks assignments and writing assignments are used for peer reviews.
from assignment list. After students complete their tasks, they These assignments are team–based assignments where more than
submit their outputs to receive reviews from peers in the peer- two students collaborate together. We use six review assignments
review system. The submissions will be reviewed by anonymous where four of six are related to writing and results and two out of
peers who can provide helpful comments and give scores based on six are related to programming assignments.
rubrics. Researchers have worked on peer review systems for
In this study, instructors manually assess submitted reviews
decades. Researchers improved Expertiza for effective learning
and assign scores within one review period where each student may
management systems and peer-review systems.
review multiple submissions. A final grade is given based on the
Students expect to receive author feedback. Typically, a students’ submission and the quality of their reviews when
double-blinded review process makes difficult for students to assessing their peers’ submissions.
explain what they have done, especially when reviewers may
misunderstand the contents of the submissions and give low grades. 3.1.2 Data Preparation
In Expertiza, peer review may have multiple rounds where the Data cleaning process is required before we process data
reviewers give feedback for improvements and check if the analytics, which includes combining multiple Database and Excel
suggestions have been implemented in next round. Each round have tables based on the user’s id using SAS. During this process, we
its several deadlines which are useful for organizing reviewing and remove entries where numeric scores are 0 or NULL, which
resubmission. indicate empty. Invalid numeric scores can be assigned when
students dropped their courses and did not assign scores on
In Expertiza, the functionality for supporting wikis is submissions of peer students. In addition, a rubric may require only
integrated for collaboration among students. Also, for submissions, textual feedback, which is not included in this study.
students may use a wiki, which is very helpful in supporting
collaborative work in writing assignments. These wikis provide 3.2 Research Questions
several features for easy editing and keeping track of the past We investigate several important factors that influence
edition. instructor-assigned grades, especially reviewers assigning scores
behaviors for instructor-assigned grades. As we explained in
2.2 Peer Review Section 1, we refer instructor-assigned grades (i.e., grades) as the
Each student can select more than one submission to review students’ reviewing capability score assigned by the instructor.
within one assignment period. Each review consists of a review
rubric to guide students in the completion of the review. Each rubric To study the usefulness of review quality assessment, we
may include multiple questions, called criteria. Appendix A. is an investigate the following research questions:
example of rubric, which consists of 12 rubric criteria. For example,  RQ1: Is there correlation between reviewer assigning different
each question may ask for assessments of the organization, scores (i.e., “spread” between scores) to different rubric items,
originality, grammar issues or clarity of a writing submission under and instructor-assigned grades?
review. The rubric also asks whether the submission contains the
acceptable quality of the definitions, examples, and links found in  RQ2: How well does our model, based on reviewer assigning
the submission. different scores, predicts instructor-assigned grades?
In the peer review process, reviewers often provide two kinds  RQ3: How well does our model combined with text metrics of
of feedback: quantitative (scores) and qualitative feedback. reviews predict instructor-assigned grades?
Reviewers measure numeric scores for certain rubric criteria. In
other words, after the reviewers read the rubric, they submitted their  RQ4: How well does our model combined with text metrics and
textual feedback and numeric scale scores for each criterion. reputation scores of reviewers predict instructor-assigned
grades?
For example, rubric criteria can be, “on a scale of 1 (worst) to
5 (best), how easy is it to understand the code?” Moreover, We describe more details about research questions. For RQ1,
reviewers are often required to provide formative textual feedback reviewers may assign grades for multiple submissions within the
where their comments incorporate issues identified, suggestions, same review. This research question investigates strong/weak
and comments. As numeric scores may be helpful, but textual correlation between reviewers with high reviewing capability and
feedback also gives more concrete ideas on the submissions. spread between scores. Note that the spread between scores is
measured by weighted standard deviation described in Section 3.3.
3. METHODOLOGY RQ2 investigates whether development of a model based on
This section discusses methodology for this study. reviewer assigning different scores would be effective for
predicting how good the reviewer is. RQ3 investigates whether this
model incorporating textual feedback shows positive results for - Summative content: This content type is positive feedback
predicting how good the reviewer is. RQ4 investigates whether our or a summary of the submission. For example, "The page is
model combined with text metrics and reputation scores shows organized logically" is classified into summative content.
positive results for predicting how good the reviewer is. Note that - Problem-detection content: This content type Identifies
we use a text analysis tool to automatically extract text metrics [ 5, problems in the submission. For example, "The page lacks a
6]. We measure text metrics for given textual feedback such as qualitative approach and an overview" is classified into
content type, tone, and volume.
problem-detection content.
3.3 Metrics - Advisory content: this content type provides suggestions to
We utilize the following metrics to address research questions. the students for improving their work. For example, "The
page could contain more ethics related links" is classified
 Pearson Correlation Coefficients: Pearson Correlation into advisory content.
Coefficients measures simple linear correlation between sets
 Tone: reviews may include different tones, which refer to the
of data. This shows a degree of how well they are related.
semantic orientation of a text given words and presentation
The correlation is measured as follows:
written by reviewers. This metric classify contents into one
of three tones: positive, negative or neutral. This metric is
- Positive: A review is classified as having a positive tone
when it contains positive feedback overall. For example,
positive words or phrase such as “well-organized paper” and
“complete” indicate positive semantic orientation.
We measure the correlation between the reviewer assigning - Negative: A review is classified as having a negative tone
different scores to different rubric items, and that reviewer being when it contains negative feedback overall. For example,
given a high grade by the instructor. The correlation coefficient negative words or phrase such as “copied”, “poor”, and “not
ranges between −1 to 1 where 1 implies perfect linear relation complete” indicate negative semantic orientation.
between X and Y, and -1 implies that, when X values increases, Y - Neutral: A review is classified into a neutral tone when it is
values decreases linearly. 0 implies no linear relation. contains neutral feedback and a mix of positive and negative
 Weighted Standard Deviation: This weighted standard feedback. For example, a mix of positive and negative words
deviation metric is measured as follows. or phrase such as “The organization looks good overall;
however, we did not understand the terms."” indicate neutral
semantic orientation: “looks good” can be positive and “did
1
ŵ √ ∑𝑛 2
𝑖=1(𝑥𝑖 − Μ) where standard deviation is a degree not understand” can be negative semantic orientation.
𝑛
to measure the spread of observed numbers (x1, x2, .., xn) in  Volume: reviews may include different words. This metric
a data set with the mean value M of the observation numbers refers to the quantity of unique tokens in the review
and weight ŵ. We measure this value to the degree of spread excluding stop words such as pronouns.
of scores given by reviewers. ŵ is the number of reviews We use Lauw’s Reputation Score identified by Song et al.
assigned to each reviewer within one assignment. [11]. Lauw-peer algorithm is based on the measure of the reviewer's
 Average Number of Words (Avg. # Words): Given more leniency (“bias”), which can be either positive or negative.
than one review comment, this metric is the average number
 Lauw’s Reputation Score: this metric is the measure who is
of words.
good reviewer based on history review scores across
We measure weighted standard deviation and average number artifacts. The reputation range calculated by the Lauw
of words, which are used as inputs for machine learning algorithms algorithm is [0,1]. A reputation score close to 1 means the
for predicting instructor-assigned grades. reviewer is credible.
 Root Mean Square Error (RMSE): The RMSE between We measure text metrics, which are used as additional inputs
predicted values and actual values is computed as square root for machine learning algorithms for predicting instructor-assigned
of the mean of the squares of the deviations. grades.
 Score Difference (Score Diff): This metric is the gap
between predicted values and actual values. 3.4 Approach
Machine learning approaches [12, 13] such as K-nearest
RMSE and Score Diff are used to measure the effectiveness of neighbor, decision tree, and neural network are useful for
models. Especially, if RMSE and Score Diff are larger, a prediction prediction. For our experiments, we use K-nearest neighbor and
model is less effective. If RMSE and Score Diff are smaller, a decision tree, which are based on supervised learning [13]. We first
prediction model is more effective for prediction. choose K-nearest neighbor classifiers because these ones are based
on learning by analogy of input values. With this model, we can
For determining the quality of the textual feedback, we use observe the closeness patterns based on Euclidean distance between
various text metrics, which can be collected automatically via a text training and validation data sets. As this model is rooted on
analysis tool [5, 6]. analogy, we use another approach, which is known high
 Content Type: Identification of content types in reviews: performance on classification for categories of values. Decision
reviews may include different types of contents. This metric tree model is a good fit for our input data set with different values.
classifies contents into one of three categories: summative, In addition, this approach is useful for comprehensibility to show a
problem-detection, or advisory content. tree structure. In this paper, we did not consider other machine
Table 1. Examples of writing assignment reviews for a Table 2. Input data sets used for our decision tree and k-
submission where numeric score and textual feedback Nearest Neighbors algorithm for predicting instructor-
given by reviewers on assignments. Detailed questions assigned grades for reviewing.
related to rubric criteria is found in Appendix A.
Name Input Data Set
Rubric Criteria Score Textual Feedback  Weighted Standard Deviation
Base
The organization is good and  Average Number of Words
A1. Organization 5
nicely gives intro, features  Content Type
and then examples of the Text  Tone
framework  Volume
They are clear and the
A2. Clarity 4 Rep  Lauw’s Reputation scores
language is easily understood
No, I don't see any changes
A3. Revision 2
from previous version

learning algorithms SVM and Naïve Bayes because these require
complicated calculation of formula and it is not easy to track how
values are classified.
We first propose a simple baseline model, which predicting
instructor-assigned grades by the average grades for reviewing for
a training data set. This baseline model is used to justify whether
machine learning algorithms would be useful for grade prediction.
We develop models based on machine learning algorithms, which
can show better prediction than the simple baseline model.
As Expertiza has been used for several years, we have
sufficient data concerning students’ reviews and instructor grade
Figure 1. Instructor-assigned grade distribution where X-axis
assignment. We divide our data into a training set and a validation
represents scores and Y-axis represents the number of review
set. Our goal for using the machine learning approach is to predict
assignments.
instructor grade assignments. This problem is related to the
reputation of peers. The peers who carefully review other students’ combine two approaches whether the combined models show better
submissions are likely to receive higher scores. We first use a prediction.
training data set for training our decision tree model and then apply
the model on a validation set for score prediction. Consider that Alice gives scores and comments to Bob. Note
that rubric criteria A1, A2, and A3 are found in the Appendix A.
In this paper, we use a decision tree [12] to explore and model
our data. Decision trees are typically used in operations research, Step 1. Collect a list of scores that students gave to
especially for decision analysis. As one would like to predict assignments. Consider that Alice gave scores and comments to
specific decisions related to scores, decision models are applied. Bob’s assignment. As shown in Table 1, Alice rated 5, 4, and 2 for
We used an SAS tool, called JMP [12], in which machine learning the assignment of Bob. This score is used for calculating weighted
approaches are integrated. In this tool, the partition function standard deviation. Weighted standard deviation is 4.5 considering
recursively partitions data according to the relationship between that its standard deviation is 1.5 and its weight is 3.
data sets. Given the relationship between X and Y values, a tree is Step 2. Collect list of comments given to a reviewer. The
created determining how to generate a tree of partitions. In this
number of word per each textual feedback is 15, 9, and 9. The total
process, the tool automatically searches groups and continues
count of words is 33. The average number of words is 11.
splitting off separate groups. This process is conducted recursively
until the tool reaches a specific desired fit. The tool stops when the Step 3. Consider that Alice’s reviewing score is 95 assigned
prediction result is no longer improved. by the instructor. Using weighted standard deviation and the
average number of words, we use a decision tree to predict this
We also consider the k-Nearest Neighbors algorithm (or k-
instructor-assigned grade. From the collected data set, we use 2/3
NN for short). The k-NN algorithm is a widely used algorithm of data as a training set and 1/3 of data as a validation data set. This
among all machine learning algorithms. k-NN is a non-parametric decision tree is trained to predict the instructor-assigned grade.
method used for classification and is supported by SAS/JMP tool.
The input consists of the k closest training set, which contains Step 4. Using weighted standard deviation and the average
similar features. k-NN is a type of learning to find approximated number of words, the k-NN algorithm is used to predict this
locally similar classification. instructor-assigned grade. We set k with 10.
We describe steps of our work using the example reviews in Step 5. Calculate reputation of a reviewer using Lauw’s
Table 1. We compare automated metrics and reputation score Reputation Model. Then, this reputation score is converted to a
assessed by reputation algorithm, called Lauw’s model for predicted instructor-assigned grades.
predicting the instructor-assigned grades for reviewing. We also
Step 6. Compare performance of prediction. RMSE and Score
Diff are used to measure the effectiveness of approaches.
Then, we also extend these models using different sets of Hypothesis 2: Our decision tree and K-nearest neighbor models
metrics for decision tree and k-NN algorithm, especially, described based on reviewer assigning different scores are effective for
in Table 2. predicting instructor-assigned grades. That is, decision tree and k-
nearest neighbor models have smaller RMSEs than that of Lauw’s
4. RESULTS Reputation Model.
In this Section, we investigate factors to influence
instructor-assigned grades. Figure 1 shows instructor-assigned The purpose of this study is to investigate whether
grade distribution. As shown in Figure 1, the grades are well development of models based on a reviewer assigning different
distributed above 75. There are some low scores. scores would be effective for predicting instructor-assigned grades.
The first step is to apply the decision tree model to partition
4.1 Effect of Reviewer Assigning Different data for the best performance. We divide the data into training set
Scores and validation set. We assign 2/3 of a set as a training data set for
We describe hypothesis 1 to answer RQ1. modelling. The remaining 1/3 of a data set is used as a validation
Hypothesis 1: There is a strong correlation between a reviewer data set for comparison. The second step is to calculate the average
assigning different scores to different rubric items, and instructor- difference and RMSE between actual grades and instructor-
assigned grades. That is, a reviewer who is careful to consider what assigned grades based on the decision tree model. The third step is
score a student should receive for each rubric item (and therefore to compare results with results gained from K-nearest neighbor,
gives different scores for different rubric items) is likely to be baseline and prediction model based on reputation system, called
assigned a higher grade by the instructor, than is a student who Lauw’s algorithm [11].
tends to assign the same score (e.g., 4 out of 5) to all or almost all When we use the decision tree and k-nearest neighbor models,
rubric items. we employ two inputs: weighted standard deviations, and the
The purpose of this study is to investigate the effect of average number of words for reviews given by students within one
reviewer assigning different scores for grade prediction. This assignment. The output is the predicted grades. To compare
research question investigates whether reviewers with high quality performance, we measure the average of the absolute value of score
reviews may show some correlation between the scores assigned to difference between an actual grade and corresponding predicted
different rubric items and grades. grade. RMSE is also used to measure how close the predicted
values are to actual values. Note that, as grades vary from low to
The first step is to find and collect review scores within the high, accurate grade prediction cannot be achieved with high
same assignments per student. The second step is to calculate accuracy. Instead, we measure the average difference between
weighted standard deviation from the list of scores. The third step actual grades and instructor-assigned grades.
is to calculate the relationship between this deviation and
instructor-assigned grades. Assumption is that student who give All available valid peer-review records are used in this
scores differently would be more a careful reviewer. This student experiment. We measure that score difference range, average
may receive higher grades for their review assignments. absolute bias and root mean square error (RMSE) in Tables 3 and
4. Tables 3 and 4 present the results from decision tree (DT) model,
For the Pearson correlation, good fit is useful to predict an k-nearest neighbor (k-NN) model, baseline and Lauw’s Rep Model
anticipated future rate. We assess the statistical significance using for different data sets inputs. For example, Base+Text means that
statistical testing methods. In this context, we measure p-value with base and text input data sets in Table 2 are used.
regard to those correlation models. The p-value represents the
probability of satisfying a model. The p-value is considered to be We observe a case of base data set input only since this
an estimate of the 'goodness of fit' of the model. Typically, the test research question is related to only base metric. For base data set
of satisfying the model is statistically significant if the p-value inputs, we observe that the decision tree and k-nearest neighbor
<0.05. We used SAS software for conducting this analysis. models have smaller RMSEs than that of baseline and Lauw’s
Reputation models for writing and programming assignments.
A Pearson product-moment correlation coefficient was Therefore, the decision tree and K-nearest neighbor models are
computed to assess the relationship the deviation metric and scores. more effective for prediction in this case. DT model and k-NN
There was a positive, but weak linear correlation between the two model are data-driven models, which assess input data and find the
variables, r = 0.1, p = 0.03. Note that r is correlation coefficient and best fit to correlate these inputs with an output. We observe that
p is p-value. As r is small, we observe that there is a positive, but Lauw’s Rep Model is dependent on data sets. For example, the
weak correlation between a reviewer assigning different scores to range of [0,1] is useful for reputation score. However, if one
different rubric items, and instructor-assigned grades. Note that this receives 0 as a reputation score, then, she/he may be expected to
shows only linear correlation. receive the lowest grade (e.g., 0), but this cannot happen because
instructor consider many aspects other than reputation.
In addition, as shown in Section 4.2, we use the metric, a
reviewer assigning different scores to different rubric items, for We conclude that data supports hypothesis 2.
building models because decision tree models with this metric are
effective in terms of grade prediction. 4.3 Prediction of Instructor-Assigned Grades
With regards to Pearson product-moment correlation, we for Reviewing using Text Metrics
conclude that data does not support hypothesis 1. We describe hypothesis 3 to answer RQ3.
Hypothesis 3: Our decision tree and k-nearest neighbor models
based on additional text metrics are more effective for predicting
4.2 Prediction of Instructor-Assigned Grades instructor-assigned grades than the preceding models (based on
for Reviewing reviewer assigning different scores). That is, decision tree and k-
We describe hypothesis 2 to answer RQ2.
Table 3. Experimental results for writing assignments based on our decision tree (DT) model, k-nearest neighbor (k-NN)
model, baseline and Lauw’s Rep Model. The decision tree has a lower RMSE than the baseline and the Lauw’s reputation
model, and the RMSE decreases each time the decision tree is refined. The k-nearest neighbor has a lower RMSE than the
Lauw’s reputation model and the RMSE is similar each time the k-nearest neighbor is refined.

Decision Tree K-Nearest Neighbors
Base+ Lauw’s
Base+ Base+ Base+ Base+ Base+ Baseline
Base Base Text+Rep Rep
Text Rep Text+Rep Text Rep Model
Avg. Abs. Score
9.4 8.8 8.7 8.0 8.8 9.1 8.6 8.9 9.9 16.2
Diff
RMSE 13.0 12.6 11.4 10.1 12.6 13.6 12.2 12.9 13.2 20.8
Avg. RMSE 11.7 14.4 13.2 20.8

Table 4. Experimental results for programming assignments based on decision tree (DT) model, k-nearest neighbor (k-
NN) model, baseline and Lauw’s Rep Model. The decision tree has a lower RMSE than the baseline and the Lauw’s reputation
model, and the RMSE is similar each time the decision tree is refined except one based Base+Text+Rep. The k-nearest
neighbor has a lower RMSE than the Lauw’s reputation model and the RMSE tends to be increased each time the k-nearest
neighbor is refined.
Decision Tree k-Nearest Neighbors
Base+ Lauw’s
Base+ Base+ Base+ Base+ Base+ Baseline
Base Base Text+Rep Rep
Text Rep Text+Rep Text Rep Model
Avg. Abs. Score
8.7 8.4 8.6 9.9 8.4 8.5 8.0 9.0 8.9 16.2
Diff
RMSE 11.9 11.0 11.7 13.0 9.0 10.8 12.3 11.2 13.1 20.8
Avg. RMSE 11.9 10.8 13.1 20.8

nearest neighbor models based on additional text metrics have show substantial improvement. Review contents and tone
smaller RMSEs. generated by our meta-review service are not highly analogous for
the similar grades. We observe that some students may have higher
We investigates whether our models with additional text review grades with negative tones and summary content. But other
metrics derived from textual feedback show more effective results students might have higher review grades with positive tones and
for predicting instructor-assigned grades than the preceding problem-detection content. However, K-nearest neighbor models
models. In this study, we measure text metrics from textual cannot distinguish these cases. Additionally, K-nearest neighbor
feedback: content, tone, and volume. models incorporated with more, yet unrelated variables may be less
The purpose of this study is to investigate whether additional effective than those limited to selected and related variables.
text metrics can be useful as predictive metric for improving Our results are dependent on which models would be used. We
decision tree prediction results with regard to instructor-assigned conclude that data analyzed by decision tree models supports
grades. The first step is to measure text metrics from reviews. We hypothesis 3. We conclude that data analyzed by the k-nearest
create our models to partition data for best performance. For this neighbor models assignments does not support hypothesis 3.
model, we use metrics such as weighted standard deviations, the
average number of words, content type, tone, and volume. We 4.4 Prediction of Instructor-Assigned Grades
divide the data into training and validation sets. The second step is
to calculate the average difference between actual grades and for Reviewing using Text Metrics and
instructor-assigned grades for our models. The third step is that we Reputation Models
compare results with the one resulting from our models in the We describe hypothesis 4 to answer RQ4.
Section 4.2.
Hypothesis 4: Our decision tree and k-nearest neighbor models
Tables 3 and 4 show the results our models with text metrics based on additional reputation scores improve prediction of
for grade prediction. All available valid peer-review records are instructor-assigned grades. That is, decision tree and k-nearest
used in this experiment. We observe the average score difference neighbor models based on additional reputation scores have
and root mean square error (RMSE) in Tables 3 and 4. From this smaller RMSEs.
result, when we compare RMSE results between base and base+text We investigates whether our models with additional text
cases, we see that for the decision tree model, additional text metrics and reputation model scores shows positive results for
metrics help improve the prediction power for grades. We see that predicting instructor-assigned grades. In this study, we measure
for k-nearest neighbor model, additional text metrics do help text metrics from textual feedback: content, tone, and volume.
improve the prediction power for grades. The purpose of this study is to investigate whether additional
K-nearest neighbor model results are based on analogy, which reputation scores can be useful as predictive variables for
is not be effective for prediction in this case. Volume is already improving decision tree prediction results with regard to instructor-
accounted for in the number of words. Therefore, volume may not assigned grades. The first step is to calculate reputation scores [11]
from reviews. We create our models to partition data for best
performance. The second step is to calculate the average difference [3] Gehringer, E., "Expertiza: information management for
between actual grades and instructor-assigned grades for our collaborative learning." Monitoring and Assessment in
models. The third step is that we compare results with the one Online Collaborative Environments: Emergent
resulting from the model in the preceding Section. Computational Technologies for E-Learning Support, pp
143-159, 2009.
Tables 3 and 4 shows the results of grade prediction. All
available valid peer-review records are used in this experiment. We [4] Kulkarni, Chinmay, Koh Pang Wei, Huy Le, Daniel Chia,
observe the average absolute score difference and root mean square Kathryn Papadopoulos, Justin Cheng, Daphne Koller, and
error (RMSE) in Tables 3 and 4. From this results, for writing Scott R. Klemmer. "Peer and self assessment in massive
assignments, the decision tree model with base+text+rep data online classes." In Design Thinking Research, pp. 131-168.
inputs is the most effective in terms of RMSE. We infer that Springer International Publishing, 2015.
reputation score helps improve the performance of grades [5] Ramachandran, L. and Gehringer, E., “Automated
prediction in this case. However, for programming assignments, assessment of review quality using latent semantic
the decision tree model with base+text+rep data inputs is not the analysis,” 11th IEEE International Conference on Advanced
most effective in terms of RMSE. The reasons would be for Learning Technologies, 2011.
programming assignments, the focus of reviewing is to check the
correctness of program behaviors and requirements with shorter [6] Ramachandran, L. and Gehringer, E., “An automated
textual feedback compared with ones from writing assignments. approach to assessing the quality of code reviews,” American
Society for Engineering Education, San Antonio, TX, 2012.
Our results are dependent on which assignments would be [7] Margerum, L., Gulsrud, M., Manlapez, R., "Application of
used. We conclude that our data does partially support hypothesis calibrated peer review (CPR) writing assignments to enhance
4: the decision tree models with writing assignments supports experiments with an environmental chemistry focus." J.
hypothesis 4, but the decision tree models with programming Chemical Education 84, no. 2 (2007): 292.
assignments does not support hypothesis 4.
[8] Luca de Alfaro and Michael Shavlovsky. CrowdGrader: a
5. CONCLUSIONS AND FUTURE WORK tool for crowdsourcing the evaluation of homework
Peer review is an effective and useful method for improving assignments. Proc. 45th ACM technical symposium on
students’ learning by reviewing peer students’ work. The quality of Computer science education (SIGCSE '14). ACM, pp 415-
peer reviews is important when guiding students. To improve the 420, 2014.
quality of peer reviews, instructors grade their reviews based on [9] Jonsson, A. and Svingby, G., The Use of Scoring Rubrics:
students’ scores and feedback. However, this process is manual, Reliability, Validity and Educational Consequences
and automated decisions would be helpful. Prediction of the Educational Research Review, v2 n2, pp130-144, 2007
instructor-assigned grades is a complex and challenging problem in
peer review systems. We used machine learning techniques [10] Song, Y., Hu, Z. and Gehringer, E.F., Closing the Circle:
algorithms to build models for grade prediction for reviewing. Use of Students' Responses for Peer-Assessment Rubric
Experimental results showed that the decision tree model and K- Improvement. Proc. Advances in Web-Based Learning--
nearest neighbor (k-NN) model are more effective than Lauw’s ICWL 2015, pp 27-36, 2015
Repudiation Model in terms of RMSE. We also compared the [11] Song, Y., Hu, Z. and Gehringer, E.F., Pluggable reputation
average RMSE values for the decision tree and k-NN models. systems for peer review: A web-service approach. FIE pp 1-
Experimental results showed that the decision tree models (avg. 5, 2015
RMSE: 11.7) are more effective than k-NN models (avg. RMSE: [12] JMP Decision Tree Model
14.4) for writing assignments in terms of the average value of https://www.jmp.com/support/downloads/pdf/jmp11/Speciali
RMSE. Experimental results showed that the k-NN models (avg. zed_Models.pdf
RMSE: 10.8) are slightly more effective than decision tree models
(avg. RMSE: 11.9) for programming assignments in terms of the [13] Phyu, Thair Nu. "Survey of classification techniques in data
average of RMSE. Text metrics may be useful for classifying mining." In Proceedings of the International
contents, but showed less effect on grade prediction. Future work MultiConference of Engineers and Computer Scientists, vol.
includes the followings. First, we improve the prediction 1, pp. 18-20. 2009.
capabilities of the model. We investigate any other metric to
capture a certain feature of data, which can improve the
performance. Second, we explore semantics of text, which also help 8. APPENDIX
guide modelling with higher performance.
Appendix A. Examples of Rubric Criteria of
6. ACKNOWLEDGMENTS Writing Assignments in CSC 517
This study is partially funded by the PeerLogic project under the
National Science Foundation grants 1432347, 1431856, 1432580,
1432690, and 1431975. No Question Score
Range
A1 Organization: how logical and clear is the (Terrible
7. REFERENCES organization? organization)
[1] Topping, K.. "Peer assessment between students in colleges 0 to 5 (Very
logical and
and universities." Review of educational Research 68.3
clear)
(1998): 249-276. A2 Clarity: Are the sentences clear, and non- (Terrible
[2] Cloudera: http://www.cloudera.com/, 2016 duplicative? Does the language used in this English
artifact simple and basic to be understood? usage) 0 to 5
(Good 5 (adequate
English citations)
usage)
A3 Did the authors revise their work in accordance (Not agree) 0 A12 Rate how logical and clear the organization is. (terrible
with your suggestions? to 5 (Strong Point out any places where you think that the organization)
agree) organization of this article needs to be 1 to 5 (very
A4 Originality: If you found any plagiarism in (Several improved. logical and
round 1, has it been removed? Then, randomly places of clear)
pick some sentences or paragraphs and search plagiarism
for them with a search engine. Describe any text spotted) 0 to Appendix B. Snapshot of Decision Tree Model
that may infringe copyrights. 5 (No
plagiarism
for Writing Assignments for Base+Text+Rep
spotted) Metrics
A5 Coverage: does the artifact cover all the (Not agree) 0
important aspects that readers need to know to 5 (Strong
about this topic? Are all the aspects discussed agree)
at about the same level of detail?
A6 Definitions: are the definitions of unfamiliar (Several
terms clear and concise? Are the definitions definitions
adequately supported by explanations or are missing
examples? or
incomplete)
0 to 5 (Strong
agree)
A7 References: do the major concepts have (Many more
citations to more detailed treatments? Are there references
any unavailable links? should be
added) 0 to 5
(Strong
agree)
A8 List the unfamiliar terms used in this wiki. Are (neither
those unfamiliar terms well defined or linked to defined nor
proper references? linked) 1 to 5
(well defined
or links are
added)
A9 Rate the overall readability of the article. (not readable
Explain why you give this score. and
confusing) 1
to 5 (readable
and not
confusing)
A10 Rate the English usage. Give a list of spelling, (terrible
grammar, punctuation mistakes or language English
usage mistakes you can find in this wiki (e.g. usage) 1 to 5
ruby on rails -> Ruby on Rails). (good
English
usage)
A11 List any related terms or concepts for which the (more
writer failed to give adequate citations and citations are
links. Rate the helpfulness of the citations. needed) 1 to