=Paper= {{Paper |id=Vol-1633/ws1-paper5 |storemode=property |title=Prediction of Grades for Reviewing with Automated Peer-review and Reputation Metrics |pdfUrl=https://ceur-ws.org/Vol-1633/ws1-paper5.pdf |volume=Vol-1633 |authors=Da Young Lee,Ferry Pramudianto,Edward F. Gehringer |dblpUrl=https://dblp.org/rec/conf/edm/LeePG16 }} ==Prediction of Grades for Reviewing with Automated Peer-review and Reputation Metrics== https://ceur-ws.org/Vol-1633/ws1-paper5.pdf
  Prediction of Grades for Reviewing with Automated Peer-
                review and Reputation Metrics
                              Da Young Lee, Ferry Pramudianto, Edward F. Gehringer
                                                     North Carolina State University
                                                           Raleigh, NC 27695
                                                  [dlee10, fferry, efg]@ncsu.edu


ABSTRACT                                                                       To address this issue, this study aims at investigating methods
Peer review is an effective and useful method for improving              to help identify good reviewers who write high-quality reviews. To
students’ learning through review by student peers. Peer review has      attain this goal, we examine factors that may influence review
been used in classes for several decades. To ensure the success of       scores and propose a model to predict how good the reviewers are
peer review, research challenges such as the quality of peer review      based on the reviews written by them using machine learning
must be addressed. It is challenging to identify how good the            algorithms.
reviewer is. We develop a prediction model to assess students’                We investigate several important factors that influence
reviewing capability. We investigate several important factors that      instructor-assigned grades, especially reviewers assigning scores
influence students’ reviewing capability, which corresponds to           behaviors for instructor-assigned grades. In this paper, we refer
instructor-assigned grades for reviewing. We use machine learning        instructor-assigned grades (i.e., grades) as the students’ reviewing
techniques algorithms to build models for grade prediction for           capability score assigned by the instructor. Another factor is
reviewing. Our models are based on the several metrics such as the       automated peer-review metrics, which are text metrics [5, 6] such
reviewer assigning different scores to different rubric items and        as tone for assessing the textual feedback given by the reviewers.
automated metrics to assess the textual feedback given by the            The other factor is a reputation metric [11] to determine who is
reviewers. To improve the models, we also use reputation score of        good reviewer based on history review scores across artifacts. This
students’ as reviewers. We present results of experiments to show        reputation metric is calculated based on the measure of the
the effectiveness of the models.                                         reviewer's leniency (“bias”).
                                                                              In this paper, we first investigate strong/weak correlation
                                                                         between reviewers with high reviewing capability and spread
Keywords                                                                 between scores. Note that the spread between scores corresponds to
                                                                         deviation for reviewer assigning different scores described in
Peer reviews, rubrics, prediction model
                                                                         Section 3.3. We then investigate whether development of a model
1. INTRODUCTION                                                          based on reviewer assigning different scores would be effective for
     Peer review [1, 4] is an effective and useful method for            predicting how good the reviewer is. For this task, we apply
improving students’ learning by reviewing peer students’ work.           machine learning techniques such as a decision tree [12] and k-
Peer review has been used in classes for several decades. In recent      Nearest Neighbors [13] algorithms to build a model for prediction.
years, peer review has been used not only for traditional classes but    We then investigate whether this model incorporating textual
also for online courses such as Massive Open Online Courses              feedback shows positive results for predicting how good the
(MOOC) [4]. For example, in Coursera [2], several online courses         reviewer is. Lastly, we investigate whether our model combined
are offered, in which thousands of students from around the world        with text metrics and reputation scores shows positive results for
are enrolled. In such cases, instructors are not able to give feedback   predicting how good the reviewer is. For these tasks, we investigate
to such a large number of students in a timely manner. Therefore,        following research questions:
development of peer review methods based on observing peer                  RQ1: Is there correlation between reviewer assigning different
behaviors is important, and the technology should be improved to             scores (i.e., “spread” between scores) to different rubric items,
be more reliable and useful to users.                                        and instructor-assigned grades?
      The classroom peer review process is as follows. Students             RQ2: How well does our model, based on reviewer assigning
submit their assignments. Reviewers (peer students) provide                  different scores, predicts instructor-assigned grades?
reviews of the assignments. The students have a chance to improve
their submitted work by incorporating scores and comments in the            RQ3: How well does our model combined with text metrics of
reviews. Because reviewers in educations are peer students, they             reviews predict instructor-assigned grades?
may lack sufficient peer reviewing experience. Therefore, they
need to be guided through the peer review process to ensure the             RQ4: How well does our model combined with text metrics and
provision of high-quality reviews.                                           reputation scores of reviewers predict instructor-assigned
                                                                             grades?
     The assessment of reviews is a challenging problem in
education. Meta-reviewing is a manual process [5] where instructor            The rest of the paper is organized as follows. In Section 2, we
might assign grades and provide feedback as a measure of the             briefly introduce peer review process and peer review system,
students’ reviewing capability. The problem is that the manual           called Expertiza [3]. In Section 3, we describe our methodology for
process of meta-reviewing is tedious and time-consuming.
the study. In Section 4, we present our experimental results. Finally,   3.1 Data
we give concluding remarks in Section 5.
                                                                         3.1.1 Data Collection
2. BACKGROUND                                                                 We assemble peer-review data from Expertiza [3]. This tool is
This section discusses background for this study.                        a web-based educational learning application that helps students
                                                                         review peers’ work. We analyze 703 records submitted by students
2.1 Peer Review System: Expertiza                                        where the students are assigned to grade assignments of peers.
     There are many tools to help peer-review process [3, 7, 8].
                                                                              The data set is collected from two graduate-level courses: CSC
Expertiza is a web-based education system where a feature for
                                                                         517 (Object-Oriented Design and Development) and CSC 506
enabling peer reviews is integrated. This feature is a part of an
                                                                         (Architecture of Parallel Computers). Both are offered at North
active learning process from peer students.
                                                                         Carolina State University. For example, in CSC 517, programming
     Using Expertiza, in classes, students are able to select tasks      assignments and writing assignments are used for peer reviews.
from assignment list. After students complete their tasks, they          These assignments are team–based assignments where more than
submit their outputs to receive reviews from peers in the peer-          two students collaborate together. We use six review assignments
review system. The submissions will be reviewed by anonymous             where four of six are related to writing and results and two out of
peers who can provide helpful comments and give scores based on          six are related to programming assignments.
rubrics. Researchers have worked on peer review systems for
                                                                              In this study, instructors manually assess submitted reviews
decades. Researchers improved Expertiza for effective learning
                                                                         and assign scores within one review period where each student may
management systems and peer-review systems.
                                                                         review multiple submissions. A final grade is given based on the
      Students expect to receive author feedback. Typically, a           students’ submission and the quality of their reviews when
double-blinded review process makes difficult for students to            assessing their peers’ submissions.
explain what they have done, especially when reviewers may
misunderstand the contents of the submissions and give low grades.       3.1.2 Data Preparation
In Expertiza, peer review may have multiple rounds where the                  Data cleaning process is required before we process data
reviewers give feedback for improvements and check if the                analytics, which includes combining multiple Database and Excel
suggestions have been implemented in next round. Each round have         tables based on the user’s id using SAS. During this process, we
its several deadlines which are useful for organizing reviewing and      remove entries where numeric scores are 0 or NULL, which
resubmission.                                                            indicate empty. Invalid numeric scores can be assigned when
                                                                         students dropped their courses and did not assign scores on
     In Expertiza, the functionality for supporting wikis is             submissions of peer students. In addition, a rubric may require only
integrated for collaboration among students. Also, for submissions,      textual feedback, which is not included in this study.
students may use a wiki, which is very helpful in supporting
collaborative work in writing assignments. These wikis provide           3.2 Research Questions
several features for easy editing and keeping track of the past               We investigate several important factors that influence
edition.                                                                 instructor-assigned grades, especially reviewers assigning scores
                                                                         behaviors for instructor-assigned grades. As we explained in
2.2 Peer Review                                                          Section 1, we refer instructor-assigned grades (i.e., grades) as the
     Each student can select more than one submission to review          students’ reviewing capability score assigned by the instructor.
within one assignment period. Each review consists of a review
rubric to guide students in the completion of the review. Each rubric         To study the usefulness of review quality assessment, we
may include multiple questions, called criteria. Appendix A. is an       investigate the following research questions:
example of rubric, which consists of 12 rubric criteria. For example,       RQ1: Is there correlation between reviewer assigning different
each question may ask for assessments of the organization,                   scores (i.e., “spread” between scores) to different rubric items,
originality, grammar issues or clarity of a writing submission under         and instructor-assigned grades?
review. The rubric also asks whether the submission contains the
acceptable quality of the definitions, examples, and links found in         RQ2: How well does our model, based on reviewer assigning
the submission.                                                              different scores, predicts instructor-assigned grades?
     In the peer review process, reviewers often provide two kinds          RQ3: How well does our model combined with text metrics of
of feedback: quantitative (scores) and qualitative feedback.                 reviews predict instructor-assigned grades?
Reviewers measure numeric scores for certain rubric criteria. In
other words, after the reviewers read the rubric, they submitted their      RQ4: How well does our model combined with text metrics and
textual feedback and numeric scale scores for each criterion.                reputation scores of reviewers predict instructor-assigned
                                                                             grades?
     For example, rubric criteria can be, “on a scale of 1 (worst) to
5 (best), how easy is it to understand the code?” Moreover,                   We describe more details about research questions. For RQ1,
reviewers are often required to provide formative textual feedback       reviewers may assign grades for multiple submissions within the
where their comments incorporate issues identified, suggestions,         same review. This research question investigates strong/weak
and comments. As numeric scores may be helpful, but textual              correlation between reviewers with high reviewing capability and
feedback also gives more concrete ideas on the submissions.              spread between scores. Note that the spread between scores is
                                                                         measured by weighted standard deviation described in Section 3.3.
3. METHODOLOGY                                                                RQ2 investigates whether development of a model based on
This section discusses methodology for this study.                       reviewer assigning different scores would be effective for
                                                                         predicting how good the reviewer is. RQ3 investigates whether this
model incorporating textual feedback shows positive results for               - Summative content: This content type is positive feedback
predicting how good the reviewer is. RQ4 investigates whether our             or a summary of the submission. For example, "The page is
model combined with text metrics and reputation scores shows                  organized logically" is classified into summative content.
positive results for predicting how good the reviewer is. Note that           - Problem-detection content: This content type Identifies
we use a text analysis tool to automatically extract text metrics [ 5,        problems in the submission. For example, "The page lacks a
6]. We measure text metrics for given textual feedback such as                qualitative approach and an overview" is classified into
content type, tone, and volume.
                                                                              problem-detection content.
3.3 Metrics                                                                   - Advisory content: this content type provides suggestions to
     We utilize the following metrics to address research questions.          the students for improving their work. For example, "The
                                                                              page could contain more ethics related links" is classified
    Pearson Correlation Coefficients: Pearson Correlation                    into advisory content.
     Coefficients measures simple linear correlation between sets
                                                                             Tone: reviews may include different tones, which refer to the
     of data. This shows a degree of how well they are related.
                                                                              semantic orientation of a text given words and presentation
     The correlation is measured as follows:
                                                                              written by reviewers. This metric classify contents into one
                                                                              of three tones: positive, negative or neutral. This metric is
                                                                              - Positive: A review is classified as having a positive tone
                                                                              when it contains positive feedback overall. For example,
                                                                              positive words or phrase such as “well-organized paper” and
                                                                              “complete” indicate positive semantic orientation.
     We measure the correlation between the reviewer assigning                - Negative: A review is classified as having a negative tone
different scores to different rubric items, and that reviewer being           when it contains negative feedback overall. For example,
given a high grade by the instructor. The correlation coefficient             negative words or phrase such as “copied”, “poor”, and “not
ranges between −1 to 1 where 1 implies perfect linear relation                complete” indicate negative semantic orientation.
between X and Y, and -1 implies that, when X values increases, Y              - Neutral: A review is classified into a neutral tone when it is
values decreases linearly. 0 implies no linear relation.                      contains neutral feedback and a mix of positive and negative
    Weighted Standard Deviation: This weighted standard                      feedback. For example, a mix of positive and negative words
     deviation metric is measured as follows.                                 or phrase such as “The organization looks good overall;
                                                                              however, we did not understand the terms."” indicate neutral
                                                                              semantic orientation: “looks good” can be positive and “did
          1
     ŵ √ ∑𝑛          2
          𝑖=1(𝑥𝑖 − Μ) where standard deviation is a degree                    not understand” can be negative semantic orientation.
          𝑛
     to measure the spread of observed numbers (x1, x2, .., xn) in           Volume: reviews may include different words. This metric
     a data set with the mean value M of the observation numbers              refers to the quantity of unique tokens in the review
     and weight ŵ. We measure this value to the degree of spread              excluding stop words such as pronouns.
     of scores given by reviewers. ŵ is the number of reviews                 We use Lauw’s Reputation Score identified by Song et al.
     assigned to each reviewer within one assignment.                    [11]. Lauw-peer algorithm is based on the measure of the reviewer's
    Average Number of Words (Avg. # Words): Given more                  leniency (“bias”), which can be either positive or negative.
     than one review comment, this metric is the average number
                                                                             Lauw’s Reputation Score: this metric is the measure who is
     of words.
                                                                              good reviewer based on history review scores across
     We measure weighted standard deviation and average number                artifacts. The reputation range calculated by the Lauw
of words, which are used as inputs for machine learning algorithms            algorithm is [0,1]. A reputation score close to 1 means the
for predicting instructor-assigned grades.                                    reviewer is credible.
    Root Mean Square Error (RMSE): The RMSE between                          We measure text metrics, which are used as additional inputs
     predicted values and actual values is computed as square root       for machine learning algorithms for predicting instructor-assigned
     of the mean of the squares of the deviations.                       grades.
    Score Difference (Score Diff): This metric is the gap
     between predicted values and actual values.                         3.4 Approach
                                                                               Machine learning approaches [12, 13] such as K-nearest
     RMSE and Score Diff are used to measure the effectiveness of        neighbor, decision tree, and neural network are useful for
models. Especially, if RMSE and Score Diff are larger, a prediction      prediction. For our experiments, we use K-nearest neighbor and
model is less effective. If RMSE and Score Diff are smaller, a           decision tree, which are based on supervised learning [13]. We first
prediction model is more effective for prediction.                       choose K-nearest neighbor classifiers because these ones are based
                                                                         on learning by analogy of input values. With this model, we can
     For determining the quality of the textual feedback, we use         observe the closeness patterns based on Euclidean distance between
various text metrics, which can be collected automatically via a text    training and validation data sets. As this model is rooted on
analysis tool [5, 6].                                                    analogy, we use another approach, which is known high
    Content Type: Identification of content types in reviews:           performance on classification for categories of values. Decision
     reviews may include different types of contents. This metric        tree model is a good fit for our input data set with different values.
     classifies contents into one of three categories: summative,        In addition, this approach is useful for comprehensibility to show a
     problem-detection, or advisory content.                             tree structure. In this paper, we did not consider other machine
  Table 1. Examples of writing assignment reviews for a                  Table 2. Input data sets used for our decision tree and k-
  submission where numeric score and textual feedback                    Nearest Neighbors algorithm for predicting instructor-
  given by reviewers on assignments. Detailed questions                  assigned grades for reviewing.
  related to rubric criteria is found in Appendix A.
                                                                                 Name                   Input Data Set
  Rubric Criteria       Score          Textual Feedback                                           Weighted Standard Deviation
                                                                               Base
                                  The organization is good and                                    Average Number of Words
  A1. Organization         5
                                  nicely gives intro, features                                    Content Type
                                  and then examples of the                     Text               Tone
                                  framework                                                       Volume
                                  They are clear and the
  A2. Clarity              4                                                   Rep                Lauw’s Reputation scores
                                  language is easily understood
                                  No, I don't see any changes
  A3. Revision             2
                                  from previous version

learning algorithms SVM and Naïve Bayes because these require
complicated calculation of formula and it is not easy to track how
values are classified.
      We first propose a simple baseline model, which predicting
instructor-assigned grades by the average grades for reviewing for
a training data set. This baseline model is used to justify whether
machine learning algorithms would be useful for grade prediction.
We develop models based on machine learning algorithms, which
can show better prediction than the simple baseline model.
      As Expertiza has been used for several years, we have
sufficient data concerning students’ reviews and instructor grade
                                                                           Figure 1. Instructor-assigned grade distribution where X-axis
assignment. We divide our data into a training set and a validation
                                                                           represents scores and Y-axis represents the number of review
set. Our goal for using the machine learning approach is to predict
                                                                           assignments.
instructor grade assignments. This problem is related to the
reputation of peers. The peers who carefully review other students’      combine two approaches whether the combined models show better
submissions are likely to receive higher scores. We first use a          prediction.
training data set for training our decision tree model and then apply
the model on a validation set for score prediction.                            Consider that Alice gives scores and comments to Bob. Note
                                                                         that rubric criteria A1, A2, and A3 are found in the Appendix A.
      In this paper, we use a decision tree [12] to explore and model
our data. Decision trees are typically used in operations research,            Step 1. Collect a list of scores that students gave to
especially for decision analysis. As one would like to predict           assignments. Consider that Alice gave scores and comments to
specific decisions related to scores, decision models are applied.       Bob’s assignment. As shown in Table 1, Alice rated 5, 4, and 2 for
We used an SAS tool, called JMP [12], in which machine learning          the assignment of Bob. This score is used for calculating weighted
approaches are integrated. In this tool, the partition function          standard deviation. Weighted standard deviation is 4.5 considering
recursively partitions data according to the relationship between        that its standard deviation is 1.5 and its weight is 3.
data sets. Given the relationship between X and Y values, a tree is          Step 2. Collect list of comments given to a reviewer. The
created determining how to generate a tree of partitions. In this
                                                                         number of word per each textual feedback is 15, 9, and 9. The total
process, the tool automatically searches groups and continues
                                                                         count of words is 33. The average number of words is 11.
splitting off separate groups. This process is conducted recursively
until the tool reaches a specific desired fit. The tool stops when the        Step 3. Consider that Alice’s reviewing score is 95 assigned
prediction result is no longer improved.                                 by the instructor. Using weighted standard deviation and the
                                                                         average number of words, we use a decision tree to predict this
      We also consider the k-Nearest Neighbors algorithm (or k-
                                                                         instructor-assigned grade. From the collected data set, we use 2/3
NN for short). The k-NN algorithm is a widely used algorithm             of data as a training set and 1/3 of data as a validation data set. This
among all machine learning algorithms. k-NN is a non-parametric          decision tree is trained to predict the instructor-assigned grade.
method used for classification and is supported by SAS/JMP tool.
The input consists of the k closest training set, which contains              Step 4. Using weighted standard deviation and the average
similar features. k-NN is a type of learning to find approximated        number of words, the k-NN algorithm is used to predict this
locally similar classification.                                          instructor-assigned grade. We set k with 10.
     We describe steps of our work using the example reviews in               Step 5. Calculate reputation of a reviewer using Lauw’s
Table 1. We compare automated metrics and reputation score               Reputation Model. Then, this reputation score is converted to a
assessed by reputation algorithm, called Lauw’s model for                predicted instructor-assigned grades.
predicting the instructor-assigned grades for reviewing. We also
                                                                              Step 6. Compare performance of prediction. RMSE and Score
                                                                         Diff are used to measure the effectiveness of approaches.
     Then, we also extend these models using different sets of             Hypothesis 2: Our decision tree and K-nearest neighbor models
metrics for decision tree and k-NN algorithm, especially, described        based on reviewer assigning different scores are effective for
in Table 2.                                                                predicting instructor-assigned grades. That is, decision tree and k-
                                                                           nearest neighbor models have smaller RMSEs than that of Lauw’s
4. RESULTS                                                                 Reputation Model.
           In this Section, we investigate factors to influence
instructor-assigned grades. Figure 1 shows instructor-assigned                  The purpose of this study is to investigate whether
grade distribution. As shown in Figure 1, the grades are well              development of models based on a reviewer assigning different
distributed above 75. There are some low scores.                           scores would be effective for predicting instructor-assigned grades.
                                                                                The first step is to apply the decision tree model to partition
4.1 Effect of Reviewer Assigning Different                                 data for the best performance. We divide the data into training set
Scores                                                                     and validation set. We assign 2/3 of a set as a training data set for
We describe hypothesis 1 to answer RQ1.                                    modelling. The remaining 1/3 of a data set is used as a validation
Hypothesis 1: There is a strong correlation between a reviewer             data set for comparison. The second step is to calculate the average
assigning different scores to different rubric items, and instructor-      difference and RMSE between actual grades and instructor-
assigned grades. That is, a reviewer who is careful to consider what       assigned grades based on the decision tree model. The third step is
score a student should receive for each rubric item (and therefore         to compare results with results gained from K-nearest neighbor,
gives different scores for different rubric items) is likely to be         baseline and prediction model based on reputation system, called
assigned a higher grade by the instructor, than is a student who           Lauw’s algorithm [11].
tends to assign the same score (e.g., 4 out of 5) to all or almost all          When we use the decision tree and k-nearest neighbor models,
rubric items.                                                              we employ two inputs: weighted standard deviations, and the
     The purpose of this study is to investigate the effect of             average number of words for reviews given by students within one
reviewer assigning different scores for grade prediction. This             assignment. The output is the predicted grades. To compare
research question investigates whether reviewers with high quality         performance, we measure the average of the absolute value of score
reviews may show some correlation between the scores assigned to           difference between an actual grade and corresponding predicted
different rubric items and grades.                                         grade. RMSE is also used to measure how close the predicted
                                                                           values are to actual values. Note that, as grades vary from low to
     The first step is to find and collect review scores within the        high, accurate grade prediction cannot be achieved with high
same assignments per student. The second step is to calculate              accuracy. Instead, we measure the average difference between
weighted standard deviation from the list of scores. The third step        actual grades and instructor-assigned grades.
is to calculate the relationship between this deviation and
instructor-assigned grades. Assumption is that student who give                 All available valid peer-review records are used in this
scores differently would be more a careful reviewer. This student          experiment. We measure that score difference range, average
may receive higher grades for their review assignments.                    absolute bias and root mean square error (RMSE) in Tables 3 and
                                                                           4. Tables 3 and 4 present the results from decision tree (DT) model,
      For the Pearson correlation, good fit is useful to predict an        k-nearest neighbor (k-NN) model, baseline and Lauw’s Rep Model
anticipated future rate. We assess the statistical significance using      for different data sets inputs. For example, Base+Text means that
statistical testing methods. In this context, we measure p-value with      base and text input data sets in Table 2 are used.
regard to those correlation models. The p-value represents the
probability of satisfying a model. The p-value is considered to be              We observe a case of base data set input only since this
an estimate of the 'goodness of fit' of the model. Typically, the test     research question is related to only base metric. For base data set
of satisfying the model is statistically significant if the p-value        inputs, we observe that the decision tree and k-nearest neighbor
<0.05. We used SAS software for conducting this analysis.                  models have smaller RMSEs than that of baseline and Lauw’s
                                                                           Reputation models for writing and programming assignments.
     A Pearson product-moment correlation coefficient was                  Therefore, the decision tree and K-nearest neighbor models are
computed to assess the relationship the deviation metric and scores.       more effective for prediction in this case. DT model and k-NN
There was a positive, but weak linear correlation between the two          model are data-driven models, which assess input data and find the
variables, r = 0.1, p = 0.03. Note that r is correlation coefficient and   best fit to correlate these inputs with an output. We observe that
p is p-value. As r is small, we observe that there is a positive, but      Lauw’s Rep Model is dependent on data sets. For example, the
weak correlation between a reviewer assigning different scores to          range of [0,1] is useful for reputation score. However, if one
different rubric items, and instructor-assigned grades. Note that this     receives 0 as a reputation score, then, she/he may be expected to
shows only linear correlation.                                             receive the lowest grade (e.g., 0), but this cannot happen because
                                                                           instructor consider many aspects other than reputation.
     In addition, as shown in Section 4.2, we use the metric, a
reviewer assigning different scores to different rubric items, for              We conclude that data supports hypothesis 2.
building models because decision tree models with this metric are
effective in terms of grade prediction.                                    4.3 Prediction of Instructor-Assigned Grades
    With regards to Pearson product-moment correlation, we                 for Reviewing using Text Metrics
conclude that data does not support hypothesis 1.                          We describe hypothesis 3 to answer RQ3.
                                                                           Hypothesis 3: Our decision tree and k-nearest neighbor models
                                                                           based on additional text metrics are more effective for predicting
4.2 Prediction of Instructor-Assigned Grades                               instructor-assigned grades than the preceding models (based on
for Reviewing                                                              reviewer assigning different scores). That is, decision tree and k-
We describe hypothesis 2 to answer RQ2.
       Table 3. Experimental results for writing assignments based on our decision tree (DT) model, k-nearest neighbor (k-NN)
     model, baseline and Lauw’s Rep Model. The decision tree has a lower RMSE than the baseline and the Lauw’s reputation
     model, and the RMSE decreases each time the decision tree is refined. The k-nearest neighbor has a lower RMSE than the
                  Lauw’s reputation model and the RMSE is similar each time the k-nearest neighbor is refined.

                                        Decision Tree                           K-Nearest Neighbors
                                                                                                       Base+                       Lauw’s
                                     Base+      Base+      Base+                Base+      Base+                     Baseline
                            Base                                        Base                          Text+Rep                      Rep
                                     Text        Rep      Text+Rep              Text        Rep                                    Model
     Avg. Abs. Score
                             9.4       8.8       8.7          8.0       8.8       9.1        8.6          8.9           9.9          16.2
          Diff
         RMSE               13.0      12.6      11.4         10.1       12.6     13.6      12.2          12.9          13.2          20.8
      Avg. RMSE                               11.7                                       14.4                          13.2          20.8


        Table 4. Experimental results for programming assignments based on decision tree (DT) model, k-nearest neighbor (k-
  NN) model, baseline and Lauw’s Rep Model. The decision tree has a lower RMSE than the baseline and the Lauw’s reputation
      model, and the RMSE is similar each time the decision tree is refined except one based Base+Text+Rep. The k-nearest
   neighbor has a lower RMSE than the Lauw’s reputation model and the RMSE tends to be increased each time the k-nearest
                                                     neighbor is refined.
                                         Decision Tree                          k-Nearest Neighbors
                                                                                                   Base+                           Lauw’s
                                     Base+      Base+       Base+               Base+ Base+                         Baseline
                            Base                                        Base                     Text+Rep                           Rep
                                     Text        Rep       Text+Rep             Text      Rep                                      Model
     Avg. Abs. Score
                             8.7       8.4       8.6          9.9        8.4      8.5       8.0          9.0           8.9           16.2
          Diff
         RMSE               11.9      11.0      11.7          13.0       9.0     10.8       12.3        11.2          13.1           20.8
      Avg. RMSE                               11.9                                        10.8                        13.1           20.8

nearest neighbor models based on additional text metrics have             show substantial improvement. Review contents and tone
smaller RMSEs.                                                            generated by our meta-review service are not highly analogous for
                                                                          the similar grades. We observe that some students may have higher
     We investigates whether our models with additional text              review grades with negative tones and summary content. But other
metrics derived from textual feedback show more effective results         students might have higher review grades with positive tones and
for predicting instructor-assigned grades than the preceding              problem-detection content. However, K-nearest neighbor models
models. In this study, we measure text metrics from textual               cannot distinguish these cases. Additionally, K-nearest neighbor
feedback: content, tone, and volume.                                      models incorporated with more, yet unrelated variables may be less
     The purpose of this study is to investigate whether additional       effective than those limited to selected and related variables.
text metrics can be useful as predictive metric for improving                  Our results are dependent on which models would be used. We
decision tree prediction results with regard to instructor-assigned       conclude that data analyzed by decision tree models supports
grades. The first step is to measure text metrics from reviews. We        hypothesis 3. We conclude that data analyzed by the k-nearest
create our models to partition data for best performance. For this        neighbor models assignments does not support hypothesis 3.
model, we use metrics such as weighted standard deviations, the
average number of words, content type, tone, and volume. We               4.4 Prediction of Instructor-Assigned Grades
divide the data into training and validation sets. The second step is
to calculate the average difference between actual grades and             for Reviewing using Text Metrics and
instructor-assigned grades for our models. The third step is that we      Reputation Models
compare results with the one resulting from our models in the             We describe hypothesis 4 to answer RQ4.
Section 4.2.
                                                                          Hypothesis 4: Our decision tree and k-nearest neighbor models
     Tables 3 and 4 show the results our models with text metrics         based on additional reputation scores improve prediction of
for grade prediction. All available valid peer-review records are         instructor-assigned grades. That is, decision tree and k-nearest
used in this experiment. We observe the average score difference          neighbor models based on additional reputation scores have
and root mean square error (RMSE) in Tables 3 and 4. From this            smaller RMSEs.
result, when we compare RMSE results between base and base+text                We investigates whether our models with additional text
cases, we see that for the decision tree model, additional text           metrics and reputation model scores shows positive results for
metrics help improve the prediction power for grades. We see that         predicting instructor-assigned grades. In this study, we measure
for k-nearest neighbor model, additional text metrics do help             text metrics from textual feedback: content, tone, and volume.
improve the prediction power for grades.                                       The purpose of this study is to investigate whether additional
     K-nearest neighbor model results are based on analogy, which         reputation scores can be useful as predictive variables for
is not be effective for prediction in this case. Volume is already        improving decision tree prediction results with regard to instructor-
accounted for in the number of words. Therefore, volume may not           assigned grades. The first step is to calculate reputation scores [11]
                                                                          from reviews. We create our models to partition data for best
performance. The second step is to calculate the average difference   [3] Gehringer, E., "Expertiza: information management for
between actual grades and instructor-assigned grades for our              collaborative learning." Monitoring and Assessment in
models. The third step is that we compare results with the one            Online Collaborative Environments: Emergent
resulting from the model in the preceding Section.                        Computational Technologies for E-Learning Support, pp
                                                                          143-159, 2009.
     Tables 3 and 4 shows the results of grade prediction. All
available valid peer-review records are used in this experiment. We   [4] Kulkarni, Chinmay, Koh Pang Wei, Huy Le, Daniel Chia,
observe the average absolute score difference and root mean square        Kathryn Papadopoulos, Justin Cheng, Daphne Koller, and
error (RMSE) in Tables 3 and 4. From this results, for writing            Scott R. Klemmer. "Peer and self assessment in massive
assignments, the decision tree model with base+text+rep data              online classes." In Design Thinking Research, pp. 131-168.
inputs is the most effective in terms of RMSE. We infer that              Springer International Publishing, 2015.
reputation score helps improve the performance of grades              [5] Ramachandran, L. and Gehringer, E., “Automated
prediction in this case. However, for programming assignments,            assessment of review quality using latent semantic
the decision tree model with base+text+rep data inputs is not the         analysis,” 11th IEEE International Conference on Advanced
most effective in terms of RMSE. The reasons would be for                 Learning Technologies, 2011.
programming assignments, the focus of reviewing is to check the
correctness of program behaviors and requirements with shorter        [6] Ramachandran, L. and Gehringer, E., “An automated
textual feedback compared with ones from writing assignments.             approach to assessing the quality of code reviews,” American
                                                                          Society for Engineering Education, San Antonio, TX, 2012.
     Our results are dependent on which assignments would be          [7] Margerum, L., Gulsrud, M., Manlapez, R., "Application of
used. We conclude that our data does partially support hypothesis         calibrated peer review (CPR) writing assignments to enhance
4: the decision tree models with writing assignments supports             experiments with an environmental chemistry focus." J.
hypothesis 4, but the decision tree models with programming               Chemical Education 84, no. 2 (2007): 292.
assignments does not support hypothesis 4.
                                                                      [8] Luca de Alfaro and Michael Shavlovsky. CrowdGrader: a
5. CONCLUSIONS AND FUTURE WORK                                            tool for crowdsourcing the evaluation of homework
     Peer review is an effective and useful method for improving          assignments. Proc. 45th ACM technical symposium on
students’ learning by reviewing peer students’ work. The quality of       Computer science education (SIGCSE '14). ACM, pp 415-
peer reviews is important when guiding students. To improve the           420, 2014.
quality of peer reviews, instructors grade their reviews based on     [9] Jonsson, A. and Svingby, G., The Use of Scoring Rubrics:
students’ scores and feedback. However, this process is manual,           Reliability, Validity and Educational Consequences
and automated decisions would be helpful. Prediction of the               Educational Research Review, v2 n2, pp130-144, 2007
instructor-assigned grades is a complex and challenging problem in
peer review systems. We used machine learning techniques              [10] Song, Y., Hu, Z. and Gehringer, E.F., Closing the Circle:
algorithms to build models for grade prediction for reviewing.             Use of Students' Responses for Peer-Assessment Rubric
Experimental results showed that the decision tree model and K-            Improvement. Proc. Advances in Web-Based Learning--
nearest neighbor (k-NN) model are more effective than Lauw’s               ICWL 2015, pp 27-36, 2015
Repudiation Model in terms of RMSE. We also compared the              [11] Song, Y., Hu, Z. and Gehringer, E.F., Pluggable reputation
average RMSE values for the decision tree and k-NN models.                 systems for peer review: A web-service approach. FIE pp 1-
Experimental results showed that the decision tree models (avg.            5, 2015
RMSE: 11.7) are more effective than k-NN models (avg. RMSE:           [12] JMP Decision Tree Model
14.4) for writing assignments in terms of the average value of             https://www.jmp.com/support/downloads/pdf/jmp11/Speciali
RMSE. Experimental results showed that the k-NN models (avg.               zed_Models.pdf
RMSE: 10.8) are slightly more effective than decision tree models
(avg. RMSE: 11.9) for programming assignments in terms of the         [13] Phyu, Thair Nu. "Survey of classification techniques in data
average of RMSE. Text metrics may be useful for classifying                mining." In Proceedings of the International
contents, but showed less effect on grade prediction. Future work          MultiConference of Engineers and Computer Scientists, vol.
includes the followings. First, we improve the prediction                  1, pp. 18-20. 2009.
capabilities of the model. We investigate any other metric to
capture a certain feature of data, which can improve the
performance. Second, we explore semantics of text, which also help    8. APPENDIX
guide modelling with higher performance.
                                                                      Appendix A. Examples of Rubric Criteria of
6. ACKNOWLEDGMENTS                                                    Writing Assignments in CSC 517
This study is partially funded by the PeerLogic project under the
National Science Foundation grants 1432347, 1431856, 1432580,
1432690, and 1431975.                                                 No     Question                                      Score
                                                                                                                           Range
                                                                      A1     Organization: how logical and clear is the    (Terrible
7. REFERENCES                                                                organization?                                 organization)
[1] Topping, K.. "Peer assessment between students in colleges                                                             0 to 5 (Very
                                                                                                                           logical and
    and universities." Review of educational Research 68.3
                                                                                                                           clear)
    (1998): 249-276.                                                  A2     Clarity: Are the sentences clear, and non-    (Terrible
[2] Cloudera: http://www.cloudera.com/, 2016                                 duplicative? Does the language used in this   English
                                                                             artifact simple and basic to be understood?   usage) 0 to 5
                                                         (Good                                                                    5 (adequate
                                                         English                                                                  citations)
                                                         usage)
A3    Did the authors revise their work in accordance    (Not agree) 0    A12   Rate how logical and clear the organization is.   (terrible
      with your suggestions?                             to 5 (Strong           Point out any places where you think that the     organization)
                                                         agree)                 organization of this article needs to be          1 to 5 (very
A4    Originality: If you found any plagiarism in        (Several               improved.                                         logical and
      round 1, has it been removed? Then, randomly       places      of                                                           clear)
      pick some sentences or paragraphs and search       plagiarism
      for them with a search engine. Describe any text   spotted) 0 to    Appendix B. Snapshot of Decision Tree Model
      that may infringe copyrights.                      5         (No
                                                         plagiarism
                                                                          for Writing Assignments for Base+Text+Rep
                                                         spotted)         Metrics
A5    Coverage: does the artifact cover all the          (Not agree) 0
      important aspects that readers need to know        to 5 (Strong
      about this topic? Are all the aspects discussed    agree)
      at about the same level of detail?
A6    Definitions: are the definitions of unfamiliar     (Several
      terms clear and concise? Are the definitions       definitions
      adequately supported by explanations or            are missing
      examples?                                          or
                                                         incomplete)
                                                         0 to 5 (Strong
                                                         agree)
A7    References: do the major concepts have             (Many more
      citations to more detailed treatments? Are there   references
      any unavailable links?                             should      be
                                                         added) 0 to 5
                                                         (Strong
                                                         agree)
A8    List the unfamiliar terms used in this wiki. Are   (neither
      those unfamiliar terms well defined or linked to   defined nor
      proper references?                                 linked) 1 to 5
                                                         (well defined
                                                         or links are
                                                         added)
A9    Rate the overall readability of the article.       (not readable
      Explain why you give this score.                   and
                                                         confusing) 1
                                                         to 5 (readable
                                                         and        not
                                                         confusing)
A10   Rate the English usage. Give a list of spelling,   (terrible
      grammar, punctuation mistakes or language          English
      usage mistakes you can find in this wiki (e.g.     usage) 1 to 5
      ruby on rails -> Ruby on Rails).                   (good
                                                         English
                                                         usage)
A11   List any related terms or concepts for which the   (more
      writer failed to give adequate citations and       citations are
      links. Rate the helpfulness of the citations.      needed) 1 to