Automated Metareviewing: A Classifier Approach to Assess the Quality of Reviews Ravi K. Yadav Edward F. Gehringer Computer Science Department Computer Science Department North Carolina State University North Carolina State University Raleigh, NC, USA Raleigh, NC, USA rkyadav@ncsu.edu efg@ncsu.edu ABSTRACT Summative: A summative review provides either positive A review’s quality can be evaluated through metric-based feedback or a summary of the author’s work. automated metareview. But not all the metrics should be Problem detection: A review can detect one or more specific weighted the same when it comes to evaluating the overall problems in the reviewed artifact. quality of reviews. For instance, if a review identifies specific Advisory: A reviewer can provide specific advice to the author, problems about the reviewed artifact, then even with a low score which can be used by the author to improve the artifact. for other metrics it should be evaluated as a helpful review. To evaluate the usefulness of a review, we propose a use of Coverage: Coverage is a measure of review’s ability to cover the decision-tree based classifier models computed from the raw main points of the artifact. score of metareview metrics, instead of using all the metrics, we Tone: Tone refers to the semantic orientation of a text. Tone is can use a subset of them. divided into three categories: positive, negative and neutral. A single review can contain various measures of positive, negative Keywords and neutral tone. Automated metareview, Decision tree classifier Peer-review Volume: Volume measures the quantity of textual feedback systems; Education artifacts. provided by the reviewer. Plagiarism: This metric is based on the originality of a review. 1. Introduction If a review is copied, then it is marked as plagiarized. A review MOOC-based education platforms as well as face-to-face is compared against artifact, rubrics used and the internet search classrooms are increasingly adopting peer assessment. Peer results based on the review text. reviewing increases students’ participation and fosters collaborative learning. Students are encouraged to review their peers’ work and provide formative feedback. High-quality 3. Experiments Our automated metareview system is a Ruby on Rails-based feedback can help the reviewee improvise his/her work. web service [10]. All the statistical calculations are performed Reviewing (or evaluating) a review is known as metareviewing. using packages available in R. The metareview web service For best results, a review should be metareviewed before being generates quantitative scores, but to determine the overall quality presented to the reviewee. Usually this is a manual task [1, 2] for of a review based on this score, we need a statistical model. This the teaching staff, which becomes more demanding when the model, once trained, can be used to classify a review as a good or metareview is needed quickly. Automated metareviewing [3] is a a bad one. To train this model, we performed an experiment in technique of using a smart tool to evaluate the quality of a review the form of a survey. We selected a collection of student artifacts using certain textual properties of the submitted feedback. These from Expertiza [4]. We used the reviews they received from the properties include tone, volume, content type, relevance, other students in the class. These reviews were rated manually coverage, and plagiarism. Content type is further divided into by survey participants, explained in next section. The problem identification, advisory, or summative evaluation of questionnaire used to evaluate the reviews by survey participants reviewed work. These properties are the metrics used by was based on metareview metrics. Table 1 lists all the questions automated metareviewer to evaluate the usefulness of a review. used in the questionnaire. Survey participants were asked to Though a good review may contain all of these properties, we answer the questions by selecting a response on the scale of 1–5, found that a good review need not contain all. where 1 is the lowest score and 5 as the highest. In this experiment, we ignored the Plagiarism metric, hence no question 2. Metrics to assess a review was asked related to this metric. The question on “Overall As mentioned above, a metareview evaluates a review based on quality” was used to generate the class identifier for each review. certain textual properties, otherwise known as metrics. Below are the metrics used by our metareview evaluator. Experiment participants Participants were former and current TAs from different Review relevance: A relevant review should discuss the work departments of Engineering, Science and Business. We trained reviewed and try to identify problems/issues in author’s work. them by explaining the essence of each metareview metric used Review content: This metric is further divided into three in automated metareviewing. Multiple participants were asked metrics, such as: Summative, Problem Detection and Advisory. to rate the same reviews to generate a holistic model. We created an anonymous system to prevent the reviewers from knowing the metric is not equally important for evaluating the overall quality identity of the authors of the artifact and the reviews. of the review. The artifacts selected for this experiment were taken from the articles created by Spring 2016 students in CSC 517 course at NC State University. As a part of this course, students wrote Wikipedia articles which were then given to other students in class for reviewing. Each student was required to review two articles. They were given an option to review two more articles to receive extra points. Table 1: Questionnaire for the survey Associated metareview Question text metric (scale of 1– 5) How well does the review adequately reflect Summative (summarize) the artifact? How well is the problem identified by the Problem reviewer about the artifact? detection How specific is the advice provided by the Advisory reviewer to the author to improve the artifact? How relevant is the review to the artifact? Relevance Does the review cover all the parts of the Coverage artifact? What do you think about the tone used by reviewer? (1: strongly negative, 2: negative, 3: Tone Figure 1: Distribution of count of expert surveys for the score neutral, 4: positive, 5: strongly positive they received on the scale of 1-5 for each metric. Total number of surveys were 119. As explained earlier Tone was How satisfied are you with the quantity of Volume measured on a scale (-1, 0, 1). In figure for tone, -1 is comments provided by reviewer? represented by 1, 0 by 2 and 1 by 3 respectively. How would you rate the overall quality of the Overall Each review used in the survey was evaluated using the review? quality automated metareviewer, which generated metareview score for each review. The metareview web service evaluates each 4. Data model & Results sentence of a review and tries to identify positive or negative Preprocessing data words used in it from a collection of word list. If the count is A total of 119 reviews were surveyed in this experiment. Since same, then it is marked as neutral. An aggregated score of all the more than one survey participant reviewed the same artifact, sentences is calculated for the review. So if a review contains each review was assigned the average of the scores it received positive and negative sentences, then the overall score can have a from all the participants. score for positive metric as well as negative metric. But for our experiment, we scaled the overall tone score. If the overall All the questions were answered on a scale of 1–5, with 5 being positive score for a review was higher than the negative score, the “best” score. For the tone metric, we found that only two then it was translated to 1 (overall positive review). If overall surveys assigned a score of 1 (highly negative) to a review, negative score was higher than positive score, then it was whereas about 60% reviews received a score of 4 (positive). translated to –1 (overall negative review), else it was converted About 10% received a score of 5 (highly positive). We to 0 (overall neutral review). normalized the survey score for tone and grouped them into three categories. A score less than 3 (<3) was translated to –1 The survey participants had an absolute agreement (zero (Negative), whereas 3 was translated to 0 (Neutral) and a score tolerance) of 38.8% with inter-rater reliability, calculated using greater than 3 (>3) was converted to 1 (Positive). The survey weighted kappa [5], of 0.13. Inter-rater agreement increased to question associated with the overall quality of the review was 80% when the tolerance was relaxed by one point (±1). For normalized as well. A score higher than 3 was translated to good reviews surveyed by more than one person, an average score was review (1), otherwise it was marked as bad review (0). We used used to represent the final score. For some of the metrics in this metric as class identifier for our data modeling. This was Figure 1, such as coverage, summative, and problem done to create a holistic model. identification, the distribution is concentrated toward the center Figure 1 shows the distribution of surveys scores for each metric axis of graph. This explains the sudden increase of inter-rater agreement when the tolerance is relaxed by 1 point. Other individually. We can see from this figure that not all the metrics metrics such as volume, relevance, and advisory shows a fair are dispersed equally, which correlates with the idea that each distribution cross the rating scale. Sixty-five percent of reviews were rated as good whereas others review as per surveys experts. This translates to similar results, were marked as bad by the survey experts. Table 2 lists the which we derived from Table 2. Based on the experiment and the Pearson Correlation matrix between the score of the questions data collected from automated metareviewing, volume, based on metareview metrics to the overall quality of the review summative, and advisory are better suited metrics on which to as rated by survey participants. It can be easily inferred from create a model to categorize the quality of a review. Other Table 2, that each metric is highly correlated with the overall metrics like tone, and problem identification should be used in quality of the review, except tone. As per figure 1, volume and modeling as well. But metrics such as relevance, and coverage advisory are two most dispersed metrics and they also show are not performing well, so these metrics cannot be used for data greater correlation with the overall grade of a review, which modeling. makes them two most important metrics for data modelling. Table 4: Pearson Correlation between metareview metric Table 2: Pearson Correlation matrix for survey response for score and overall quality of a review (degree of freedom for each metric and overall quality of a review (degree of each metric is 117, confidence interval: 95%) freedom for each metric is 117, confidence interval: 95%) 95 % Survey Pearson p t 95 % Metareview Pearson p t confidence Metric Correlation confidenc Metric Correlation interval e interval Summative 0.22 0.02 2.46 0.04 - 0.39 Summative 0.56 0 7.36 0.43 - 0.67 Problem Problem 0.13 0.16 1.42 -0.05 - 0.30 0.57 0 7.56 0.44 - 0.68 Identification Identification Advisory 0.25 0.01 2.77 0.07 - 0.41 0.56 - Advisory 0.67 0 9.79 Coverage -0.02 0.81 -0.24 -0.20 - 0.16 0.76 Coverage 0.68 0 10.0 0.57 - 0.77 Relevance -0.05 0.61 -0.52 -0.23 - 0.13 Relevance 0.67 0 9.66 0.55 - 0.76 Tone 0.15 0.11 1.60 -0.03 - 0.32 Tone 0.20 0.032 2.17 0.02 - 0.36 Volume 0.55 0 7.07 0.41 - 0.66 Volume 0.75 0 12.2 0.66 - 0.82 Decision-tree modeling and results Table 3 shows the one-to-one correlation between the scores While selecting the model that can be used to differentiate received for the survey question based on metareview metrics between a good and a bad review, we investigated various and metareview metrics from web service respectively. As per modeling methodologies. We wanted a model that is inexpensive this table, web service and expert scores have the most to construct, which can be retrained, and is extremely fast in agreement on the volume metric. Also other metrics such as classifying unknown reviews. Also since, we are ignoring two summative, advisory and tone have appreciable agreements as metrics in this modeling, we wanted a model that is flexible to well. The correlation between the relevance metric is very weak, incorporate these variables at a later stage. One modeling which suggests that a changed strategy should be employed to technique that looks ideal for these cases is a decision tree. improve performance of the relevance metric generator. To create a decision tree, we started with Classification and Table 3: Pearson Correlation between a metric score from Regression Trees (CART) modeling using the rpart [6] library survey and metareview system (degree of freedom for each in R. This library provides various ways to generate trees, such metric is 117, confidence interval: 95%) as classification and regression. The classification method is used in this experiment to generate the tree. 95 % Pearson To find an optimal tree, a first attempt was made with volume, Metric p t confidence Correlation summative, advisory, problem identification and tone metrics. interval The summary function in rpart library shows that volume is a Summative 0.17 0.06 1.9 -0.01 - 0.34 very important metric when generating the classification tree. Problem Table 5 shows the result of the summary function, which states –0.03 0.74 -0.34 -0.21 - 0.15 that the tone and problem identification were the least preferred Identification metrics for classification Advisory 0.22 0.02 2.42 0.04 - 0.38 Table 5: Comparative variable importance for tree Coverage 0.02 0.87 0.16 -0.17 - 0.19 generation based on rpart library Relevance 0.01 0.94 0.08 -0.17 - 0.19 Volume Advisory Summative Tone Problem Tone 0.25 0.01 2.80 0.07 - 0.41 Identification Volume 0.58 0 7.67 0.44 - 0.69 64 % 15 % 13 % 4% 4% From table 2, 3 and 4, Volume shows a stronger correlation with Table 4 shows the Pearson correlation between the scores from the class identifier (overall quality). Figure 2 shows that the automated metareview metrics and the overall quality of the volume metric alone can construct a classification tree to identify review quality. This decision tree can be used to identify whether the review is good or bad on the basis of the volume score received from the automated metareview metric. For instance, if the volume metric score is greater than 68, then it is a good review, or if score is less than 26, that is a bad review. This is not pruned at the moment. Another algorithm discussed later generates a more pruned tree. Figure 3: Classification Tree based on metareview scores, excluding volume (using rpart) According to the tree in Figure 3, if a review receives a score in excess of 0.25 for advisory, then it is a good review, else we can check the score it receives for summative metric. If a review receives a score less than 0.25 for advisory and a review score in excess of 0.25 on summative, it is classified as a good review, else it is a bad review. As we can see that once the decision tree Figure 2: Unpruned Classification tree is created, process of classification of a review becomes easy. based on metareview score (using rpart). In order to validate the results received from the rpart library, Node 1 divided the sample space into two sets containing 42 and another method of tree construction was explored. One such 77 observations respectively. A review with a score of 68.5 for method is C5.0 [8], which is an extension to C4.5 [9]. C50 is the metareview metric volume is used as the first split criterion. package implemented in R, which is used to generate the tree Each node number is marked in Figure 2, with split criteria and based on the automated metareview score. 10-fold validation was class probabilities. used in decision tree construction. Figure 4 shows the final tree which includes all the metrics. As was noticed earlier in the tree Though volume can be a good classifier, volume alone should not constructed using the rpart classification method, the volume be used to identify the quality of a review. We found in another metric dominates the tree, and root node partition is based on study [7] that review volume may be related to the rubrics used volume “> 68”. This tree is shorter than tree in figure 2, because in review phase. Some rubrics can ask for more feedback from C5.0 uses tree pruning to create a shorter tree. Sometimes this reviewers than others. The volume metric can often be pruning in result in increased classification error rate. The misleading and can result in higher number of false positives. A classification error rate for this tree is 22.7%. The majority class reviewer can provide gibberish comments which can result in a probability for this classifier tree is 80.7%, which is higher when good metareview score for volume. We should consider other compared to the baseline and the classification tree generated in metrics as well to evaluate the overall quality of such a review. Figure 3. One more tree was constructed without the volume This calls for another decision tree based on other metrics. Then, metric, as is shown in Figure 5. The classification error rate for we can use both of these decision trees to classify a review. If the tree is 29.4% which is higher than the similar CART based any one tree classifies a review as a bad review, then that tree. The majority class probability using this tree comes to information can be shown to the reviewer as a guidance. This 87.4%, which is again higher compared to the baseline score and information can help the reviewer to correct issues with the the other classification tree generated in Figure 3. This shows review. that the tree generated using rpart fits the data better than the Figure 3 shows the decision tree created without the volume similar tree generated using C5.0. C5.0 seems to generate a more metric. We saw earlier that advisory and summative were next pruned tree, which is smaller in size, but with an increased two stronger metrics after volume. As per the decision-tree classification error rate. construction algorithm, these two metrics can create the decision tree as well. Since these metrics suppress tone and problem- identification metrics, we could have created another decision tree based on tone and problem identification to further classify the review. But we chose to ignore them, since as per the rpart library’s metric important their importance is very low compared to other three metrics used to generate trees in Figure 2, and Figure 3. Figure 4: Classification tree based on metareview scores (using C5.0) 5.1 Future work We used Wikipedia artifacts and reviews written for them in this experiment. To make the model more robust, more similar experiments can be done to include artifacts from other educational domains. We used supervised learning to create this model. Natural language processing (NLP) is becoming more and more efficient in determining the semantics of a text. The relevance metric generator should be updated to make it more robust, so that it can also be used in the classification decision tree. 6. Acknowledgement This work has been supported by the U.S. National Science Foundation under grants 1432347, 1431856, 1432580, 1432690, and 1431975. Figure 5: Decision Tree based on metareview scores, 7. References without volume metric (using C5.0) [1] K. Cho, "Machine classification of peer comments in physics," in Educational Data Mining, 2008, pp. 192-196. Table 6 compares the performance for majority class prediction [2] W. Xiong and D. Litman, "Empirical analysis of exploiting using different classification methods. Higher majority class review helpfulness for extractive summarization of online probability compared to baseline probability means more false reviews," Proceedings of 6th International Conference on positives. C5.0 generates shorter trees compared to CART, at the Educational Data Mining (EDM), 2013. cost of reduced accuracy at times. We found that CART based [3] L. Ramachandran, "Automated Assessment of Reviews," in classification tree is better at classification than C5.0. PhD Dissertation at North Carolina State University, Table 6: Comparison of majority class probability using Raleigh, 2013. different classification methods. [4] E. F. Gehringer, "Expertiza: Managing feedback in collaborative learning," in Monitoring and Assessment in Classification method Majority class Online Collaborative Environments: Emergent probability Computational Technologies for E-Learning Support, IGI Global Press, 2010, pp. 75-96. Base line (based on experiments) 64.7% [5] J. Cohen, "Weighted kappa: Nominal scale agreement with CART (Metareview) 72% provision for scaled disagreement or partial credit," in CART (Metareview without volume) 88% Psychological Bulletin, 1968. C5.0 (Metareview) 80.7% [6] T. Therneau, B. Atkinson and B. Ripley, "rpart: Recursive Partitioning and Regression trees," 2015. [Online]. C5.0 (Metareview without volume) 87.4% Available: https://cran.r- project.org/web/packages/rpart/index.html. 5. Discussion and conclusions Metareviewing is an essential tool, which can improve the [7] R. K. Yadav and E. F. Gehringer, "Metrics for Automated quality of reviewing. A reviewer can write a good review if Review Classification: What Review Data Show," in State- timely feedback can be provided on the review before he/she of-the-Art and Future Directions of Smart Learning, submits it to the author. Springer Singapore, 2016, pp. 333-340. As part of this work, we created a decision-tree data classifier [8] M. Kuhn, S. Weston, N. Coulter and M. Culp, "C5.0 based on the score a review receives from the automated Decision Trees and Rule-Based Models," CRAN, 08 03 metareviewer. Decision trees are fast and efficient classifiers for 2015. [Online]. Available: https://cran.r- peer review metrics. We found that certain metrics, such as project.org/web/packages/C50/C50.pdf. volume, dominate the decision trees. But reliance on the volume [9] Q. R, C4.5: Programs for Machine Learning., Morgan metric alone can generate false positives. We also created a Kaufmann Publishers, 1993. decision tree excluding the volume metric. That decision tree uses content advisory, content summative, tone and problem [10] R. K. Yadav “Web Services for Automated Assessment of detection metrics. We suggest the use of a hybrid model that Reviews”, in MS Thesis at North Carolina State University, includes use of both the trees. Each review is rated on both trees Raleigh, 2016 from Figure 2 and Figure 3. A good review should score well on both.