Automated Metareviewing: A Classifier Approach
                  to Assess the Quality of Reviews
                        Ravi K. Yadav                                                 Edward F. Gehringer
               Computer Science Department                                        Computer Science Department
               North Carolina State University                                    North Carolina State University
                      Raleigh, NC, USA                                                  Raleigh, NC, USA
                     rkyadav@ncsu.edu                                                    efg@ncsu.edu


ABSTRACT                                                              Summative: A summative review provides either positive
A review’s quality can be evaluated through metric-based              feedback or a summary of the author’s work.
automated metareview. But not all the metrics should be               Problem detection: A review can detect one or more specific
weighted the same when it comes to evaluating the overall             problems in the reviewed artifact.
quality of reviews. For instance, if a review identifies specific
                                                                      Advisory: A reviewer can provide specific advice to the author,
problems about the reviewed artifact, then even with a low score
                                                                      which can be used by the author to improve the artifact.
for other metrics it should be evaluated as a helpful review. To
evaluate the usefulness of a review, we propose a use of              Coverage: Coverage is a measure of review’s ability to cover the
decision-tree based classifier models computed from the raw           main points of the artifact.
score of metareview metrics, instead of using all the metrics, we     Tone: Tone refers to the semantic orientation of a text. Tone is
can use a subset of them.                                             divided into three categories: positive, negative and neutral. A
                                                                      single review can contain various measures of positive, negative
Keywords                                                              and neutral tone.
Automated metareview, Decision tree classifier Peer-review            Volume: Volume measures the quantity of textual feedback
systems; Education artifacts.                                         provided by the reviewer.
                                                                      Plagiarism: This metric is based on the originality of a review.
1. Introduction                                                       If a review is copied, then it is marked as plagiarized. A review
MOOC-based education platforms as well as face-to-face                is compared against artifact, rubrics used and the internet search
classrooms are increasingly adopting peer assessment. Peer            results based on the review text.
reviewing increases students’ participation and fosters
collaborative learning. Students are encouraged to review their
peers’ work and provide formative feedback. High-quality
                                                                      3. Experiments
                                                                           Our automated metareview system is a Ruby on Rails-based
feedback can help the reviewee improvise his/her work.
                                                                      web service [10]. All the statistical calculations are performed
Reviewing (or evaluating) a review is known as metareviewing.
                                                                      using packages available in R. The metareview web service
For best results, a review should be metareviewed before being
                                                                      generates quantitative scores, but to determine the overall quality
presented to the reviewee. Usually this is a manual task [1, 2] for
                                                                      of a review based on this score, we need a statistical model. This
the teaching staff, which becomes more demanding when the
                                                                      model, once trained, can be used to classify a review as a good or
metareview is needed quickly. Automated metareviewing [3] is a
                                                                      a bad one. To train this model, we performed an experiment in
technique of using a smart tool to evaluate the quality of a review
                                                                      the form of a survey. We selected a collection of student artifacts
using certain textual properties of the submitted feedback. These
                                                                      from Expertiza [4]. We used the reviews they received from the
properties include tone, volume, content type, relevance,
                                                                      other students in the class. These reviews were rated manually
coverage, and plagiarism. Content type is further divided into
                                                                      by survey participants, explained in next section. The
problem identification, advisory, or summative evaluation of
                                                                      questionnaire used to evaluate the reviews by survey participants
reviewed work. These properties are the metrics used by
                                                                      was based on metareview metrics. Table 1 lists all the questions
automated metareviewer to evaluate the usefulness of a review.
                                                                      used in the questionnaire. Survey participants were asked to
Though a good review may contain all of these properties, we
                                                                      answer the questions by selecting a response on the scale of 1–5,
found that a good review need not contain all.
                                                                      where 1 is the lowest score and 5 as the highest. In this
                                                                      experiment, we ignored the Plagiarism metric, hence no question
2. Metrics to assess a review                                         was asked related to this metric. The question on “Overall
As mentioned above, a metareview evaluates a review based on          quality” was used to generate the class identifier for each review.
certain textual properties, otherwise known as metrics. Below
are the metrics used by our metareview evaluator.                     Experiment participants
                                                                           Participants were former and current TAs from different
Review relevance: A relevant review should discuss the work           departments of Engineering, Science and Business. We trained
reviewed and try to identify problems/issues in author’s work.        them by explaining the essence of each metareview metric used
Review content: This metric is further divided into three             in automated metareviewing. Multiple participants were asked
metrics, such as: Summative, Problem Detection and Advisory.          to rate the same reviews to generate a holistic model. We created
an anonymous system to prevent the reviewers from knowing the        metric is not equally important for evaluating the overall quality
identity of the authors of the artifact and the reviews.             of the review.
      The artifacts selected for this experiment were taken from
the articles created by Spring 2016 students in CSC 517 course
at NC State University. As a part of this course, students wrote
Wikipedia articles which were then given to other students in
class for reviewing. Each student was required to review two
articles. They were given an option to review two more articles
to receive extra points.
              Table 1: Questionnaire for the survey
                                                     Associated
                                                     metareview
 Question text                                          metric
                                                     (scale of 1–
                                                          5)
 How well does the review adequately reflect
                                                     Summative
 (summarize) the artifact?
 How well is the problem identified by the            Problem
 reviewer about the artifact?                         detection
 How specific is the advice provided by the
                                                      Advisory
 reviewer to the author to improve the artifact?
 How relevant is the review to the artifact?          Relevance
 Does the review cover all the parts of the
                                                      Coverage
 artifact?
 What do you think about the tone used by
 reviewer? (1: strongly negative, 2: negative, 3:       Tone         Figure 1: Distribution of count of expert surveys for the score
 neutral, 4: positive, 5: strongly positive                          they received on the scale of 1-5 for each metric. Total
                                                                     number of surveys were 119. As explained earlier Tone was
 How satisfied are you with the quantity of
                                                       Volume        measured on a scale (-1, 0, 1). In figure for tone, -1 is
 comments provided by reviewer?
                                                                     represented by 1, 0 by 2 and 1 by 3 respectively.
 How would you rate the overall quality of the         Overall
                                                                     Each review used in the survey was evaluated using the
 review?                                               quality
                                                                     automated metareviewer, which generated metareview score for
                                                                     each review. The metareview web service evaluates each
4. Data model & Results                                              sentence of a review and tries to identify positive or negative
Preprocessing data                                                   words used in it from a collection of word list. If the count is
A total of 119 reviews were surveyed in this experiment. Since       same, then it is marked as neutral. An aggregated score of all the
more than one survey participant reviewed the same artifact,         sentences is calculated for the review. So if a review contains
each review was assigned the average of the scores it received       positive and negative sentences, then the overall score can have a
from all the participants.                                           score for positive metric as well as negative metric. But for our
                                                                     experiment, we scaled the overall tone score. If the overall
All the questions were answered on a scale of 1–5, with 5 being
                                                                     positive score for a review was higher than the negative score,
the “best” score. For the tone metric, we found that only two
                                                                     then it was translated to 1 (overall positive review). If overall
surveys assigned a score of 1 (highly negative) to a review,
                                                                     negative score was higher than positive score, then it was
whereas about 60% reviews received a score of 4 (positive).
                                                                     translated to –1 (overall negative review), else it was converted
About 10% received a score of 5 (highly positive). We
                                                                     to 0 (overall neutral review).
normalized the survey score for tone and grouped them into three
categories. A score less than 3 (<3) was translated to –1            The survey participants had an absolute agreement (zero
(Negative), whereas 3 was translated to 0 (Neutral) and a score      tolerance) of 38.8% with inter-rater reliability, calculated using
greater than 3 (>3) was converted to 1 (Positive). The survey        weighted kappa [5], of 0.13. Inter-rater agreement increased to
question associated with the overall quality of the review was       80% when the tolerance was relaxed by one point (±1). For
normalized as well. A score higher than 3 was translated to good     reviews surveyed by more than one person, an average score was
review (1), otherwise it was marked as bad review (0). We used       used to represent the final score. For some of the metrics in
this metric as class identifier for our data modeling. This was      Figure 1, such as coverage, summative, and problem
done to create a holistic model.                                     identification, the distribution is concentrated toward the center
Figure 1 shows the distribution of surveys scores for each metric    axis of graph. This explains the sudden increase of inter-rater
                                                                     agreement when the tolerance is relaxed by 1 point. Other
individually. We can see from this figure that not all the metrics
                                                                     metrics such as volume, relevance, and advisory shows a fair
are dispersed equally, which correlates with the idea that each
                                                                     distribution cross the rating scale.
Sixty-five percent of reviews were rated as good whereas others     review as per surveys experts. This translates to similar results,
were marked as bad by the survey experts. Table 2 lists the         which we derived from Table 2. Based on the experiment and the
Pearson Correlation matrix between the score of the questions       data collected from automated metareviewing, volume,
based on metareview metrics to the overall quality of the review    summative, and advisory are better suited metrics on which to
as rated by survey participants. It can be easily inferred from     create a model to categorize the quality of a review. Other
Table 2, that each metric is highly correlated with the overall     metrics like tone, and problem identification should be used in
quality of the review, except tone. As per figure 1, volume and     modeling as well. But metrics such as relevance, and coverage
advisory are two most dispersed metrics and they also show          are not performing well, so these metrics cannot be used for data
greater correlation with the overall grade of a review, which       modeling.
makes them two most important metrics for data modelling.
                                                                      Table 4: Pearson Correlation between metareview metric
Table 2: Pearson Correlation matrix for survey response for          score and overall quality of a review (degree of freedom for
   each metric and overall quality of a review (degree of                   each metric is 117, confidence interval: 95%)
 freedom for each metric is 117, confidence interval: 95%)
                                                                                                                              95 %
    Survey         Pearson          p            t       95 %        Metareview        Pearson
                                                                                                        p         t         confidence
    Metric        Correlation                         confidenc       Metric          Correlation
                                                                                                                             interval
                                                      e interval
                                                                      Summative            0.22        0.02     2.46        0.04 - 0.39
  Summative           0.56          0       7.36     0.43 - 0.67
                                                                      Problem
   Problem                                                                                 0.13        0.16     1.42        -0.05 - 0.30
                      0.57          0       7.56     0.44 - 0.68    Identification
 Identification
                                                                       Advisory            0.25        0.01     2.77        0.07 - 0.41
                                                        0.56 -
   Advisory           0.67          0       9.79                      Coverage             -0.02       0.81     -0.24       -0.20 - 0.16
                                                         0.76
   Coverage           0.68          0       10.0     0.57 - 0.77      Relevance            -0.05       0.61     -0.52       -0.23 - 0.13

   Relevance          0.67          0       9.66     0.55 - 0.76         Tone              0.15        0.11     1.60        -0.03 - 0.32

     Tone             0.20        0.032     2.17     0.02 - 0.36       Volume              0.55         0       7.07        0.41 - 0.66

    Volume            0.75          0       12.2     0.66 - 0.82
                                                                    Decision-tree modeling and results

Table 3 shows the one-to-one correlation between the scores         While selecting the model that can be used to differentiate
received for the survey question based on metareview metrics        between a good and a bad review, we investigated various
and metareview metrics from web service respectively. As per        modeling methodologies. We wanted a model that is inexpensive
this table, web service and expert scores have the most             to construct, which can be retrained, and is extremely fast in
agreement on the volume metric. Also other metrics such as          classifying unknown reviews. Also since, we are ignoring two
summative, advisory and tone have appreciable agreements as         metrics in this modeling, we wanted a model that is flexible to
well. The correlation between the relevance metric is very weak,    incorporate these variables at a later stage. One modeling
which suggests that a changed strategy should be employed to        technique that looks ideal for these cases is a decision tree.
improve performance of the relevance metric generator.              To create a decision tree, we started with Classification and
 Table 3: Pearson Correlation between a metric score from           Regression Trees (CART) modeling using the rpart [6] library
 survey and metareview system (degree of freedom for each           in R. This library provides various ways to generate trees, such
          metric is 117, confidence interval: 95%)                  as classification and regression. The classification method is
                                                                    used in this experiment to generate the tree.
                                                       95 %
                   Pearson                                          To find an optimal tree, a first attempt was made with volume,
   Metric                          p         t       confidence
                  Correlation                                       summative, advisory, problem identification and tone metrics.
                                                      interval
                                                                    The summary function in rpart library shows that volume is a
 Summative           0.17         0.06      1.9      -0.01 - 0.34   very important metric when generating the classification tree.
  Problem                                                           Table 5 shows the result of the summary function, which states
                     –0.03        0.74     -0.34     -0.21 - 0.15   that the tone and problem identification were the least preferred
Identification
                                                                    metrics for classification
  Advisory           0.22         0.02     2.42      0.04 - 0.38
                                                                         Table 5: Comparative variable importance for tree
  Coverage           0.02         0.87     0.16      -0.17 - 0.19
                                                                                 generation based on rpart library
  Relevance          0.01         0.94     0.08      -0.17 - 0.19
                                                                    Volume      Advisory     Summative        Tone      Problem
    Tone             0.25         0.01     2.80      0.07 - 0.41                                                        Identification
   Volume            0.58          0       7.67      0.44 - 0.69    64 %       15 %         13 %            4%      4%
                                                                    From table 2, 3 and 4, Volume shows a stronger correlation with
Table 4 shows the Pearson correlation between the scores from
                                                                    the class identifier (overall quality). Figure 2 shows that the
automated metareview metrics and the overall quality of the
                                                                    volume metric alone can construct a classification tree to identify
review quality. This decision tree can be used to identify whether
the review is good or bad on the basis of the volume score
received from the automated metareview metric. For instance, if
the volume metric score is greater than 68, then it is a good
review, or if score is less than 26, that is a bad review. This is
not pruned at the moment. Another algorithm discussed later
generates a more pruned tree.


                                                                       Figure 3: Classification Tree based on metareview scores,
                                                                                    excluding volume (using rpart)
                                                                     According to the tree in Figure 3, if a review receives a score in
                                                                     excess of 0.25 for advisory, then it is a good review, else we can
                                                                     check the score it receives for summative metric. If a review
                                                                     receives a score less than 0.25 for advisory and a review score in
                                                                     excess of 0.25 on summative, it is classified as a good review,
                                                                     else it is a bad review. As we can see that once the decision tree
            Figure 2: Unpruned Classification tree                   is created, process of classification of a review becomes easy.
           based on metareview score (using rpart).                  In order to validate the results received from the rpart library,
Node 1 divided the sample space into two sets containing 42 and      another method of tree construction was explored. One such
77 observations respectively. A review with a score of 68.5 for      method is C5.0 [8], which is an extension to C4.5 [9]. C50 is the
metareview metric volume is used as the first split criterion.       package implemented in R, which is used to generate the tree
Each node number is marked in Figure 2, with split criteria and      based on the automated metareview score. 10-fold validation was
class probabilities.                                                 used in decision tree construction. Figure 4 shows the final tree
                                                                     which includes all the metrics. As was noticed earlier in the tree
Though volume can be a good classifier, volume alone should not
                                                                     constructed using the rpart classification method, the volume
be used to identify the quality of a review. We found in another
                                                                     metric dominates the tree, and root node partition is based on
study [7] that review volume may be related to the rubrics used
                                                                     volume “> 68”. This tree is shorter than tree in figure 2, because
in review phase. Some rubrics can ask for more feedback from
                                                                     C5.0 uses tree pruning to create a shorter tree. Sometimes this
reviewers than others. The volume metric can often be
                                                                     pruning in result in increased classification error rate. The
misleading and can result in higher number of false positives. A
                                                                     classification error rate for this tree is 22.7%. The majority class
reviewer can provide gibberish comments which can result in a
                                                                     probability for this classifier tree is 80.7%, which is higher when
good metareview score for volume. We should consider other
                                                                     compared to the baseline and the classification tree generated in
metrics as well to evaluate the overall quality of such a review.
                                                                     Figure 3. One more tree was constructed without the volume
This calls for another decision tree based on other metrics. Then,   metric, as is shown in Figure 5. The classification error rate for
we can use both of these decision trees to classify a review. If     the tree is 29.4% which is higher than the similar CART based
any one tree classifies a review as a bad review, then that          tree. The majority class probability using this tree comes to
information can be shown to the reviewer as a guidance. This         87.4%, which is again higher compared to the baseline score and
information can help the reviewer to correct issues with the         the other classification tree generated in Figure 3. This shows
review.                                                              that the tree generated using rpart fits the data better than the
Figure 3 shows the decision tree created without the volume          similar tree generated using C5.0. C5.0 seems to generate a more
metric. We saw earlier that advisory and summative were next         pruned tree, which is smaller in size, but with an increased
two stronger metrics after volume. As per the decision-tree          classification error rate.
construction algorithm, these two metrics can create the decision
tree as well. Since these metrics suppress tone and problem-
identification metrics, we could have created another decision
tree based on tone and problem identification to further classify
the review. But we chose to ignore them, since as per the rpart
library’s metric important their importance is very low compared
to other three metrics used to generate trees in Figure 2, and
Figure 3.


                                                                       Figure 4: Classification tree based on metareview scores
                                                                                              (using C5.0)
                                                                      5.1 Future work
                                                                      We used Wikipedia artifacts and reviews written for them in this
                                                                      experiment. To make the model more robust, more similar
                                                                      experiments can be done to include artifacts from other
                                                                      educational domains. We used supervised learning to create this
                                                                      model. Natural language processing (NLP) is becoming more and
                                                                      more efficient in determining the semantics of a text. The
                                                                      relevance metric generator should be updated to make it more
                                                                      robust, so that it can also be used in the classification decision
                                                                      tree.

                                                                      6. Acknowledgement
                                                                      This work has been supported by the U.S. National Science
                                                                      Foundation under grants 1432347, 1431856, 1432580, 1432690,
                                                                      and 1431975.

    Figure 5: Decision Tree based on metareview scores,               7. References
            without volume metric (using C5.0)                        [1] K. Cho, "Machine classification of peer comments in
                                                                          physics," in Educational Data Mining, 2008, pp. 192-196.
Table 6 compares the performance for majority class prediction        [2] W. Xiong and D. Litman, "Empirical analysis of exploiting
using different classification methods. Higher majority class             review helpfulness for extractive summarization of online
probability compared to baseline probability means more false             reviews," Proceedings of 6th International Conference on
positives. C5.0 generates shorter trees compared to CART, at the          Educational Data Mining (EDM), 2013.
cost of reduced accuracy at times. We found that CART based           [3] L. Ramachandran, "Automated Assessment of Reviews," in
classification tree is better at classification than C5.0.                PhD Dissertation at North Carolina State University,
  Table 6: Comparison of majority class probability using                 Raleigh, 2013.
             different classification methods.                        [4] E. F. Gehringer, "Expertiza: Managing feedback in
                                                                          collaborative learning," in Monitoring and Assessment in
 Classification method                         Majority class             Online Collaborative Environments: Emergent
                                                probability               Computational Technologies for E-Learning Support, IGI
                                                                          Global Press, 2010, pp. 75-96.
 Base line (based on experiments)                  64.7%
                                                                      [5] J. Cohen, "Weighted kappa: Nominal scale agreement with
 CART (Metareview)                                  72%                   provision for scaled disagreement or partial credit," in
 CART (Metareview without volume)                   88%                   Psychological Bulletin, 1968.
 C5.0 (Metareview)                                 80.7%              [6] T. Therneau, B. Atkinson and B. Ripley, "rpart: Recursive
                                                                          Partitioning and Regression trees," 2015. [Online].
 C5.0 (Metareview without volume)                  87.4%
                                                                          Available: https://cran.r-
                                                                          project.org/web/packages/rpart/index.html.
5. Discussion and conclusions
Metareviewing is an essential tool, which can improve the             [7] R. K. Yadav and E. F. Gehringer, "Metrics for Automated
quality of reviewing. A reviewer can write a good review if               Review Classification: What Review Data Show," in State-
timely feedback can be provided on the review before he/she               of-the-Art and Future Directions of Smart Learning,
submits it to the author.                                                 Springer Singapore, 2016, pp. 333-340.
As part of this work, we created a decision-tree data classifier      [8] M. Kuhn, S. Weston, N. Coulter and M. Culp, "C5.0
based on the score a review receives from the automated                   Decision Trees and Rule-Based Models," CRAN, 08 03
metareviewer. Decision trees are fast and efficient classifiers for       2015. [Online]. Available: https://cran.r-
peer review metrics. We found that certain metrics, such as               project.org/web/packages/C50/C50.pdf.
volume, dominate the decision trees. But reliance on the volume       [9] Q. R, C4.5: Programs for Machine Learning., Morgan
metric alone can generate false positives. We also created a              Kaufmann Publishers, 1993.
decision tree excluding the volume metric. That decision tree
uses content advisory, content summative, tone and problem            [10] R. K. Yadav “Web Services for Automated Assessment of
detection metrics. We suggest the use of a hybrid model that               Reviews”, in MS Thesis at North Carolina State University,
includes use of both the trees. Each review is rated on both trees         Raleigh, 2016
from Figure 2 and Figure 3. A good review should score well on
both.