Can models learned from a dataset reflect acquisition of procedural knowledge? An experiment with automatic measurement of online review quality Martina Megasari Nicolas Labroche Pandu Wicaksono Patrick Marcel Chiao Yun Li Verónika Peralta Clément Chaussade University of Tours, France Shibo Cheng firstname.lastname@univ-tours.fr University of Tours, France firstname.lastname@etu.univ-tours.fr ABSTRACT difficult to define and assess. Reviews can be voted helpful or Can models learned from a dataset reflect how good are humans not by customers, but this assessment is subjective and as such at mastering a particular skill? This paper studies this question in subject to variations over time, and it is difficult to construct a the context of online reviews writing, where the skill corresponds model that accurately predicts helpfulness of a review [16]. to the procedural knowledge needed to write helpful reviews. In this paper, we show that it is possible to benefit from such To this end, we model the quality of a review by a combination very large datasets to learn an individual model of procedural of various metrics stemming from text analysis (like readability, knowledge acquisition. The resulting model of knowledge has polarity, spelling errors or length) and we use customer declared several nice properties: (1) it is not prone to the usual bias caused helpfulness as a ground truth for constructing the model. We by a single small set of evaluators that might be non represen- use Knowledge Tracing, a popular model of skill acquisition, to tative or produce a subjective evaluation, (2) it avoids defining measure the evolution of the ability to write reviews of good explicitly the procedural knowledge at hand that is replaced by a quality over a period of time. While recent studies have tried to statistical model learned over the large dataset. As a consequence, measure the quality of a review and correlate it to helpfulness, the larger the dataset, the more accurate is the modeling of the to the best of our knowledge, our work is the first to address this procedural knowledge, and the better the evaluation of the skill question as the exercise of a reviewer’s skill over a sequence of for a user is. reviews. Our experiments on a set of 41,681 Amazon book reviews To illustrate this, we experiment a use case with a dataset of the show that it is possible to accurately assess the individual skill aforementioned Amazon on-line product reviews. We chose this acquisition of writing a helpful review, based on a statistical use case because it is prototypical of how procedural knowledge model of the procedural knowledge at hand rather than human influences decision making. For instance, Mayzlin and Chevalier evaluations prone to subjectivity and variations over time. studied the effects of on-line book reviews of Amazon.com and Barnesandnoble.com and found positive correlation between the reviews and the transactions of the book [4]. This means that 1 INTRODUCTION the reviewers opinion play an important role in users’ decision In today’s era of big and open data, plenty of datasets are an- on the transaction. Automatic measurement of the reviewer skill alyzed to derive models mimicking humans by using machine may be beneficial to predict how helpful the review is. A skillful learning techniques. The representation and assessment of user writer is assumed to be able to write a good review, which can knowledge opens new possibilities for big data analytics, as dif- help the customer to make a better decision on the transaction. ferentiating among novice and expert users, taking advantage of To motivate our approach, suppose that we want to determine user experience for recommending (e.g. products or actions), cal- whether a reviewer is assumed to master the skill of writing culating advanced scores (e.g. credibility), assessing the quality helpful reviews. This is preferable to trying to predict helpfulness of users’ analysis, etc. In this paper we focus on the assessment of the reviews, because of the high variability of the reviewer of procedural knowledge from large data collections. profiles, reviews and votes received by reviews. However this Procedural knowledge is the knowledge about how to do some- skill corresponds to procedural knowledge and it is difficult to thing. Different from declarative knowledge, that is often ver- define. Therefore to evaluate the skill of each reviewer, we use balized, application of procedural knowledge may not be easily the classical Knowledge Tracing model. But instead of using the explained [3]. Models exists to evaluate procedural knowledge Knowledge Tracing directly over the votes received by reviews, acquisition, like for instance the popular Bayesian Knowledge we apply it over a model of helpfulness learned from each review. Tracing [6]. Our research question is: can this model of helpfulness be used Many open datasets illustrate the application of procedural to assess the skill accurately? Consider the 4 curves displayed knowledge. For instance, Amazon review datasets like those in Figure 1. These curves are related to the evolution over time provided by He and McAuley [12] contain customer written of the skill of writing helpful reviews of a particular reviewer reviews, where the skill of writing helpful reviews is an example (randomly extracted from the Amazon book review dataset). The of application of procedural knowledge. However, this skill is helpfulness curve is the normalized score of helpfulness received by the 20 reviews written by this reviewer. The model curve is © 2018 Copyright held by the owner/author(s). Published in the Workshop Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna, the helpfulness score as predicted for this reviewer by a model Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted learned over the entire dataset. The KT helpfulness curve predicts under the terms of the Creative Commons license CC-by-nc-nd 4.0. Figure 1: Evolution of helpfulness for a reviewer and different models of it the probability that this reviewer has acquired the skill of writing (FOG) [10] aims to estimate the years of formal education a per- helpful reviews, computed with the helpfulness score. The KT son needs to understand the text during the first reading. The model curve is the same probability computed with the model. Flesch Reading Ease (FK) [15] indicates the difficulties of a text On this example, it is obvious that even though the skill can using the number of words, number of sentences and number be considered acquired, helpfulness score is difficult to predict of syllables. Higher values indicate better readability. The Auto- due to subjectivity of the voters. On the other hand, a model of mated Readability Index (ARI) [23] measures the approximate helpfulness can be learned to predict if the skill has been acquired. representation of the US grade level needed to understand the The contributions of this paper are the following: (1) assuming text. The Coleman-Liau Index (CLI) [5] is the approximation of that writing helpful reviews is a hard to define skill, we propose a US grade level needed to understand the text. More background model for it. We use low level features of the on-line review such on readability tests can be found in [16]. as rating, spelling error ratio or readability score to build the Previous works have studied the evaluation of online reviews model that infers a high level and human-related feature which due to the popularity of online marketing nowadays. Authors of- is helpfulness. This model is learned over the entire dataset and ten pay attention to the influence of online reviews on helpfulness. can be used to predict the helpfulness of future reviews for one Korfiatis et al. investigated the interplay between helpfulness, particular reviewer. (2) Using Knowledge Tracing, we show that rating score and qualitative characteristics of the review text of this model can be used to assess skill acquisition without relying 37,221 online reviews collected from Amazon UK during March on human entered votes. In particular, we show that this model, to April in 2008 [16]. The authors theorize that helpfulness re- although learned over the entire dataset, is accurate enough to lates to a model with three aspects: conformity (relation between predict if the skill is acquired by each individual reviewer. To the review text and the rating), understandability (readability of the best of our knowledge, this work is the first to evaluate a the review text) and expressiveness (length of the review text). reviewer’s skill over a sequences of reviews with Knowledge The authors formulate several hypotheses and perform linear re- Tracing. gression to validate the relationship between the metrics derived The remainder of the paper is organized as follows. Section from reviews and the helpfulness of the reviews. Regarding un- 2 discusses related works. Section 3 defines the features used to derstandability, four common readability scores - indicating the build the model of helpfulness. Section 4 details our approach. education level the readers need to have in order to understand Section 5 explains how the experiment is performed to build the the content - are computed: FOG, FK, ARI and CLI. Their results model and exposes the results. Finally, Section 6 concludes the indicate that helpfulness of a review is directionally affected by paper and discusses some possible future work. its qualitative characteristics and in particular by review text readability. Precisely, the relationship between reviews with av- 2 RELATED WORKS AND BACKGROUND erage length and their readability scores holds for both moderate We first review recent works on online review evaluation and and extreme reviews. In addition, readability has more impact then describe the Bayesian Knowledge Tracing model and some on the length of the reviews. In their work, metrics related to of its extensions. polarity, summary text of reviews and rating deviation (between the average rating and the reviewer’s one) are not considered. 2.1 Online review evaluation Moreover, due to the purpose of the work, the books having spe- cial offers are not considered to avoid the price effect. In our work, Readability tests play an important role in online review evalua- such books are chosen due to the amount of reviews resulting tion. Various indexes have been proposed to quantify readability from this price effect. of an English text. Most of these indexes are related to the level of Based on the 7,659 book reviews on Amazon UK, Wu et al. studies a person needs to understand the text at the first reading, explored whether a negative bias exists in terms of evaluating the according to American standard. They are computed considering helpfulness [27]. The assumption was that negative reviews may the number of words, number of sentences, number of syllables be more helpful than positive ones. After applying a regression or number of characters as components. The Gunning-Fog Index model controlling factors such as readability and length of the To the best of our knowledge, no work ever focused on the reviews, the result shows that the assumption is not yet readily evolution of the quality of review text under the angle of skill applicable to online reviews. acquisition, with a model learned only on the review content. Mudambi and Schuff analyzed 1,587 reviews from Amazon.com [19] to understand how review extremity, review depth and prod- 2.2 Knowledge Tracing Models uct type affect the perceived helpfulness of the review. Their The Bayesian Knowledge Tracing model was proposed by Cor- helpfulness model is based on features rating, review text word bett and Anderson, using Bayesian network to assess people’s count, total votes and product type. Product type is either Ex- procedural knowledge acquisition or simply put “skill level” [6]. perience goods or Search goods, where Experience goods are An individual’s grasp of the procedural knowledge is expressed products that require sampling or purchase in order to evaluate as a binary variable expressing whether the corresponding skill product quality. Books are examples of experience goods. They has been mastered or not. The knowledge of an individual can- found that for experience goods, moderate reviews are more not be directly observed, but it can be induced by observing the helpful than extreme reviews (whether they are strongly positive individuals’ answers to a series of questions (or opportunities or negative). In contrast, it has been observed that reviews closer to exercise the skill) in order to guess probability distribution of to the general opinion of people (average rating score) may be knowledge mastering. Observation variables are also binary: the considered more helpful by the potential buyers [14]. answer to the question is either correct or wrong. Mc Auley and Leskovec [18] propose a latent-factor model Specifically, the Knowledge Tracing model has four parame- for recommending products that may be preferred by the users ters, namely, two learning parameters, P (L 0 ) and P (T ), and two according to their experience level at the moment. The model performance parameters, P (G) and P (S ). P (L 0 ) is the probability evaluates the evolution of users’ experiences and is based on that the skill has been mastered before answering the questions. the rating that users give to products. Unlike other works on P (T ) is the knowledge transformation probability: the probability temporal dynamics, which rely on the hypothesis that two users that the skill will be learned at each opportunity to use the skill rating a product at the same time will provide the same rating, Mc (i.e., the transition from not mastered to mastered). P (G) is the Auley and Leskovec’s model takes users’ personal development probability of guess: in the case of knowledge not mastered, the into consideration in order to evaluate the expertise degree of the probability that the individual can still answer correctly. P (S ) is reviewers. Experiments showed for example that experts’ ratings the probability of slip, i.e. to fail while the skill is already mas- are easier to predict and are more similar to each other. While tered. The model uses these parameters to calculate the learning close to our work in the idea of taking the evolution of the user probability after each question to monitor individual’s knowledge into account, this work focuses on ratings and not helpfulness, status and predict their future learning probability of knowledge and therefore does not consider the linguistic aspect of review acquisition using a Bayesian Network. text. The probability that a skill L at question i + 1 is mastered, Liu et al. considered a complex model learned using non-linear denoted P (Li+1 ) is the sum of two probabilities: (1) the posterior regression, that combines the reviewer’s expertise (based on the probability that the skill was already learned, contingent on the number of similar reviews written in the past), the writing style evidence at time i, i.e. the i th opportunity to evaluate the skill, of the review (characterized with part of speech tagging and that can either be Correct or Incorrect, and (2) the probability that counting the number of words in each tag), and the timeliness the knowledge changes from not mastered to mastered at the i th of the review [17]. They showed that the three factors predict opportunity. It can be shown in the following formula: accurately helpfulness, over a dataset of 22,819 reviews collected from IMDB. In [26], review helpfulness is considered through five features P (Li+1 ) = P (Li |Evidencei ) + (1 − P (Li |Evidencei )) ∗ P (T ) (1) including user profile aspects (age, verified purchase) together with rating, text length and the rank of the review in the webpage. where: A model learned on 12,756 reviews was shown to be reasonably robust. P (Li ) ∗ P (¬S ) P (Li |Evidencei = Correct ) = Agnihotri and Bhattacharya explored how the helpfulness of P (Li ) ∗ P (¬S ) + P (¬Li ) ∗ P (G) online reviews is affected by content readability (FK Index), sen- timent analysis and the number of reviews written by a reviewer P (Li ) ∗ P (S ) [1]. It was observed on 1608 Amazon reviews that the content P (Li |Evidencei = Incorrect ) = P (Li ) ∗ P (S ) + P (¬Li ) ∗ P (¬G) readability and text sentiment of the reviews follow curvilinear relationship with review helpfulness. Reviews whose readabil- Due to its predictive accuracy, Corbett and Anderson’s Bayesian ity score are very high or sentiment are very good would be Knowledge Tracing is one of the most popular models. How- perceived less helpful. ever, several challenges, including local minimum, degenerate Hong and Xu analyze the impact of review message and re- parameters and computational costs during fitting, still exist. viewer profile on the helpfulness of 2997 online reviews collected Hawkins et al. proposed a fitting method avoiding these prob- from Douban.com [13]. Using negative binomial regression, the lems while achieving a similar predictive accuracy, and evaluated authors show that reader participation is positively related to on- it against one of the most popular fitting methods: Expectation- line review helpfulness; Reader participation fully mediates the Maximization [11]. In this extension, the parameters are fitted effect of reviewer expertise history on online review helpfulness by estimating the most likely opportunity at which each individ- and partially mediates the effects of other three metrics: average ual learned the skill. Learner’s performance is thus annotated rating, title depth and reviewer network centrality. with an estimate of when the skill is learned, assuming that a known state can never be followed by an unknown state. This annotation is used to construct knowledge sequences, that when compared with the actual performance sequence allows to em- Feature name Category Applies to Range pirically derivate the model’s four parameters. rating Conformity all [1, 5] polarityReviewText Conformity text [-1,1] As aforementioned, traditionally, the performance of an indi- polaritySummary Conformity summary [-1,1] vidual is presented in binary value, correct or wrong, which does deviation Conformity all [0,5] not account for all the cases of skill learning situation. Wang reviewTextSER Readability text [0,1] et al. proposed to extend the Knowledge Tracing model by re- summarySER Readability summary [0,1] placing the discrete binary performance node with continuous reviewTextFOG Readability text R+ partial credit node [25]. In this extension, it is assumed that P (G) summaryFOG Readability summary R+ and P (S ) follow two Gaussian distributions, that are described reviewTextFK Readability text R respectively by their means and standard deviations. Prediction summaryFK Readability summary R of the performance node also follows a Gaussian distribution, reviewTextARI Readability text R in which the mean value is used for the prediction. Noticeably, summaryARI Readability summary R reviewTextCLI Readability text R the standard deviation contains the information of how good summaryCLI Readability summary R the prediction is. Experiments with this extension show that by reviewTextLength Extensiveness text N+ relaxing the assumption of binary correctness, the predictions of summaryLength Extensiveness summary N+ an individual’s performance can be improved. Table 1: Summary of the main features These two improvements of the Knowledge Tracing model (in the fitting method and the use of partial credits) were used suc- cessfully in sequencing educational content to students [7]. We conclude this section by noting that other models exist for pre- Feature name Min Max Mean Std Dev. dicting a learner’s skill. Specifically, Performance Factor Analysis rating 1 5 4.112 1.183 [20] uses standard logistic regression with the student perfor- polarityReviewText -0.875 0.875 0.027 0.052 mance as dependent variable. Interestingly, it is shown in [9] that polaritySummary -0.875 1 0.029 0.137 Knowledge Tracing can achieve comparable predictive accuracy deviation 0 3.786 0.452 0.615 as Performance Factor analysis. Finally, Deep Knowledge Tracing reviewTextSER 0 0.5 0.009 0.008 summarySER 0 1 0.014 0.038 [22] uses Recurrent Neural Networks to model student learning, reviewTextFOG 0 740.8 13.983 8.45 with the advantage of not having to set explicit probabilities for summaryFOG 0 42.4 9.524 10.038 slip and guess. However these models need very large datasets reviewTextFK -1788.235 121.22 58.96 24.407 to learn the latent state from sequences, and most importantly, summaryFK -1824.58 121.728 59.537 51.228 the encoding of the input vectors depends on an upper bound on reviewTextARI -6.837 919.088 11.41 10.374 the number of exercises which does not directly fit our context. summaryARI -16.22 261.67 5.162 7.769 reviewTextCLI -22.24 39.133 8.64 2.549 3 FEATURES AND METRICS summaryCLI -58.13 307.6 5.387 9.417 reviewTextLength 0 32669 1152.094 1261.787 Consistently with the previous work of Korfiatis et al. [16], our summaryLength 1 257 28.875 16.786 model of helpfulness is based on features that are grouped in Table 2: Empirical Values of Metrics three categories: Conformity, Understandability and Extensive- ness, with additional features compared to [16]. We derive met- rics, i.e., numerical attributes to be used in the definition of our model, from these features. Conformity expresses the consistency of a review being written. In addition to the classical rating, we which indicates the positiveness or negativeness of a review as a add two metrics in this category: Polarity and Deviation. Under- metric. Besides, the extremity of the rating given by the reviewer standability measures how good is the quality of the written text may indicate that the reviewer is biased and has a subjective in terms of readability. We derived five metrics to measure the point of view on the product being reviewed. Extremely high score: Spelling Error Ratio and 4 readability metrics (FOG, FK, and low rating is associated with lower levels of helpfulness than ARI, and CLI). Finally extensiveness refers to the length of the reviews with moderate rating [19]. In contrast, reviews closer review. In total, 16 metrics are defined, since length and read- to the general opinion of people (average rating score) may be ability metrics apply both to the review text and summary. We considered more helpful by the potential buyers [14]. From this detail them below, a summary of the features used in the experi- perspective, we derived the Deviation score, quantifying how ments with their name, category, theoretical and empirical range much different the rating given by the reviewer is to the average is provided in Tables 1 and 2. rating. Rating. The Rating of a review is the user input quantitative 3.1 Conformity indicator of the quality of the item reviewed (e.g., rating is from Metrics in this category relate to the consistency of the review. As 1 to 5 for Amazon Book Reviews). the content of a review consists in a rating and a written text, we can derive a relation between them. A rating should correspond Polarity. Polarity of a text is measured by using a word list that to the written review and vice versa, hence difference between indicates the positivity, negativity and objectivity of each synset. these two contents might indicate that the review is inconsistent. Polarity score of a word with the part of speech is calculated as For example, a review having 5 stars rating and very negatively the score of the positivity subtracted by the score of negativity. written is inconsistent. Needless to say, inconsistent reviews The range of the value of polarity is between -1 and 1, -1 indicates may lead to lower score of helpfulness due to the confusion it that the written text is very negative and 1 indicates that the brings. From this perspective, we consider Polarity of the text, written text is very positive. Deviation. Deviation is calculated as the absolute difference the actual importance of each attribute. We use Min-Max Scaling between the rating of a review and the average rating of the item normalization strategy. reviewed. 4.2 Model construction 3.2 Readability We build our model to measure the quality of a review, where Metrics in this category relate to the effort needed to understand quality is defined by the helpfulness ratio of the review: the text of the review. This is measured based on the number of nbHelp f ulV otes spelling errors in the written text, which is expected to be nega- help f ulness = (6) tively correlated to helpfulness [8], and with various readability nbV otes measures. where nbHelp f ulV otes is is the number of positive votes re- ceived by the review and nbV otes is the total number of votes Spelling Error Ratio (SER). Spelling Error Ratio is the number received by the review. This constitutes the class attribute value of of spelling errors divided by the text length. a supervised machine learning method to build our simple model Gunning-Fog Index (FOG). The FOG [10] aims to estimate the of helpfulness as a linear combination of the metrics. Thus, our years of formal education (according to the American system) predicted output variable y ∈ R will be expressed as a weighted a person needs to understand the text during the first reading. sum of input features x i , ∀i ∈ [1, m], m being the number of This index uses the number of words, the number of sentences features: m and the number of complex words to measure the years. A word X y= ωi × x i + b (7) is considered as a complex word if the word is using more than i=1 two syllables. where ωi ∈ R is the weight reflecting the contribution of feature nbW ords nbComplexW ords i to the overall decision and b ∈ R stands for the bias. FOG = 0.4[( ) + 100( )] (2) nbSentences nbW ords The intuition behind restricting our study to linear models is Flesch Reading Ease (FK). The FK index [15] indicates the diffi- mainly for two reasons. First, these models are more simple and culties of a text using the number of words, number of sentences can be calculated more efficiently. Second, they allow for a direct and number of syllables. interpretation of the contribution of each feature to the final nbW ords nbSyllables helpfulness decision. To this end, we try a variety of methods F K = 206.835 − 1.015( ) − 84.6( ) (3) and keep the one best fitting the dataset. nbSentences nbW ords In our tests, error measurement is done using classical corre- Automated Readability Index (ARI). The ARI [23] approximates lation coefficient, Efron’s R 2 , MAE and RMSE scores. the US grade level needed to understand the text. This index uses number of characters, number of words and number of sentences. 4.3 Skill evaluation nbCharacters nbW ords ARI = 4.71( ) + 0.5( ) − 21.43 (4) In this last phase, we apply Knowledge Tracing (KT) to sequences nbW ords nbSentences of reviews in order to estimate reviewers’ skills. We proceed as Coleman-Liau Index (CLI). The CLI [5], Like ARI, is the ap- follows: We group the reviews by reviewers, obtaining one se- proximation of US grade level needed to understand the text. quence of reviews per reviewer. Each review is considered as This index also uses number of characters, number of words and an opportunity to learn the skill (i.e. being able to write useful number of sentences as components. reviews) and is graded with a score, representing the reviewer’s nbCharacters nbSentences performance (i.e. how useful is the review). We compute two CLI = 5.89( ) − 0.3( ) − 15.8 (5) nbW ords nbW ords KT scores: (i) directly from helpfulness ratings, and (ii) from the learned helpfulness model. In the former, the reviewer’s perfor- 3.3 Extensiveness mance is calculated as the helpfulness score of the review. In the The textual part of the review consists of a text and a summary latter, it is predicted by the helpfulness model. In both cases, the of this text. For both we measure the length in characters, respec- final score, output by KT model, expresses the probability that tively called Review Text Length and Summary Length. the skill is mastered by the reviewer. We use the continuous version of KT described in [25] since 4 METHODOLOGY the scores we will consider are continuous. In this extension of Our approach is divided into three phases: metric extraction, KT, P (G) and P (S ) are assumed to follow a Gaussian distribu- model construction and skill evaluation. These phases are detailed tion, and as such, they are represented by a mean value and a below. standard deviation. As a consequence, and opposed to binary KT, the prediction P (Ln ) also follows a Gaussian distribution, 4.1 Metric extraction and feature selection whose mean expresses the value of the prediction and whose In the first phase, we calculate for each review the scores for the standard deviation expresses the confidence attached to this pre- metrics presented in Section 3, that we use to build the model of diction. To learn the 6 parameters of continuous KT, we extend helpfulness. Then we apply feature selection to reduce the set the approach proposed by Hawkins et al. [11] so that it outputs of metrics by removing redundant ones, while avoiding losing estimates of P (G) and P (S ) described by a mean and a standard too much information on the data set. We use a heuristic greedy deviation. Then, based on these 6 parameters, the estimation of method by calculating all the pairwise correlations between met- each skill acquisition P (Ln ) is performed by running 100 tests, rics. For those metrics that are highly correlated, only the ones with randomly generated values for P (G) and P (S ) following highly correlated with the helpfulness score will be kept, the their respective distribution. From these 100 P (Ln ) estimates, we others being discarded. Finally, we normalize the scores in order compute a mean and a standard deviation following the normal to be independent of attribute ranges and units and highlight hypothesis. However, the KT efficiency is known to be dependent on the and the dataset has a wide enough variety, from helpful reviews granularity of skills that are fed to the model: generally, the more and not helpful reviews. Moreover, the standard deviations of focused the skills, the better the prediction of skill acquisition. In the features indicate that creating a model from this dataset is this respect, it is possible to consider that each of the features that difficult. fed our linear predictive model of helpfulness can be considered as a sub-skill related to helpfulness. For this reason, we define 5.2 Model construction two distinct tests to evaluate the learned model of helpfulness: We now describe how the model of helpfulness is learned from In the first we simply use the output of the linear regression the dataset. Consistently with [16] our model of helpfulness is model as the predicted helpfulness for a review. In the second, constructed as a linear combination of the metrics extracted from we consider each feature metric as a possible sub-skill evaluation the review text and summary. More precisely, as explained in of the reviewer. We then learn as many KT models as there are Section 4, we use a linear classifier to learn a weight for each features. In the end, we have the probabilities that sub-skills cor- of the features introduced in the previous section, in order to responding to each feature are acquired. These sub-skills scores understand its contribution to the helpfulness score. We tested are then aggregated into one single skill acquisition probability. three different approaches to learn the feature’s weights: Linear The global validation of our proposal is given by measuring Regression, Perceptron and Support Vector Machine with linear the error between the KT based on real ratings, the KT based on kernel. We used out-of-the-shelf Weka algorithms with 10-fold the general linear model and the KT based on aggregated feature- cross validation. Table 5 summarizes the results of those tests, for based models. This error is evaluated by RMSE, which has been various size of dataset selected according to minimum number shown to be the strongest performance indicator for binary KT of votes for the reviews (918 reviews with number of votes being with significantly higher correlation than Log Likelihood and at least 200, up to 522804 reviews with at least 1 vote). Results Area Under Curve [21]. for Perceptron and SVM are not reported for the largest dataset due to too much time consumption. The results show that linear 5 EXPERIMENTS regression achieves a good compromise of accuracy and compu- Our implementation is done in Java 8, with Weka 3.8 for model tation time, with better accuracy on smaller datasets and better at learning. We used our own implementation of the knowledge handling larger datasets with no significant drop in accuracy. We tracing, whose code has been made available through Github1 therefore chose to work with linear regression in what follows. as one contribution of this paper. For polarity extraction, we use SentiWordNet [2], that lists the positivity, negativity and objec- 5.2.1 Preprocessing. We recall that our definition of helpful- tivity of each synset (set of synonyms). SentiWordNet provides ness is the number of helpful votes divided by the total number the score of each word with the part-of-speech, hence we do POS of votes, hence, a review with large number of votes is a gen- tagging for each word using Stanford POS tagging library [24]. uine representation of helpfulness from a customer’s point of view. But a review with only one vote, being a helpful one, can 5.1 Dataset description still obtain a maximum helpfulness score, which is not desirable. The dataset we use for experiments is Amazon Book Review Data Filtering the dataset by number of votes becomes necessary. In provided by Julian McAuley from UCSD [12]. We select the book order to find the appropriate minimum number of votes for each category in this dataset resulting in 22,507,155 total reviews. review, we iterated this parameter from 1 to 25 for the most im- As one of our goals is to measure the evolution of the ability to portant features of our model (i.e., after feature selection), and write reviews of good quality, we need to obtain for each reviewer checked the results in terms of correlation and expressiveness a sequence of reviews long enough to observe that evolution. (contribution of each metric), reported in Table 3. We decided Therefore, we define reviewers with less than 30 reviews as not to choose 2 datasets among those tested, based on, first, expres- so active reviewers and filter them out. In addition, we only siveness (determined by the non zero value of coefficient in the consider the reviews that have been scored by customers by linear model), and second, correlation coefficient (that indicates means of votes (helpful review or not). to which extent the model matches the dataset), for more than To confirm the hypothesis that few reviewers have written 10,000 reviews. The best interestingness and correlation coef- many reviews and that many reviewers have written few reviews, ficients were obtained for, respectively at least 12 votes and at we plotted on Figure 2 the number of reviewers (on a logarith- least 23 votes. In this phase, we are not sure about the effect of mic scale) by number of reviews, for reviewers with more than these parameters on knowledge tracing model. Therefore, we 30 reviews. Each points (x, y) in this figure represents that x keep two data sets, to see which can give a better result in knowl- reviewers have written y reviews. Furthermore, we found review- edge tracing model. In what follows, the first dataset is called ers writing so much reviews that are dubious and possibly bias minV otes = 12 and consists of 41,681 reviews while the second their reviews. For instance, reviewer of ID A14OJS0VWMOSWO dataset is called minV otes = 23 and consists of 11,083 reviews. wrote 43,201 reviews with an average score of 4.9991 out of 5. Using linear regression on the two datasets minV otes = 12 and The reviewer received 240,262 votes, of which 199,573 are helpful. minV otes = 23 results in the models described in Tables 6 and 7 In our opinion, such reviewers introduce a bias in the dataset. respectively. The models constructed are evaluated with correla- Hence we limited our experiment and selected reviewers that tion coefficient, Efron’s R 2 , MAE and RMSE scores, reported in have 30 to 50 reviews. Table 8. We calculate the score of each feature from the dataset and 5.2.2 Feature selection impact. We then proceed to feature calculate their standard deviations, reported in the last column selection, as described in Section 4.1. As shown in table 8, our of Table 2. The standard deviation of the helpfulness, that varies models before and after feature selection achieve very similar in [0,1], is 0.32, which indicates that the score is quite spread out accuracy results. If efficiency in learning the model is an issue, 1 https://github.com/Cubiccl/Continuous-Knowledge-Tracing/releases/tag/1.0 or if the model should remain as simple as possible, one can Figure 2: Number of reviews by number of reviewers minVotes Number of Number of Correlation Number of Metrics minV ot es = 12 minV ot es = 23 reviewers reviews coefficient zero coefficients rating 0.31117594 0.37121056 1 13820 522801 0.3352 0 polarityReviewText 0.36708846 0.27873667 2 13556 350158 0.3917 0 polaritySummary 0.05166703 0.08006764 3 11312 247394 0.4351 0 deviation -0.20847153 -0.1951008 4 9060 184092 0.4649 0 reviewTextSER 0 0 5 7408 142572 0.4946 0 summarySER -0.28603436 -0.25242002 6 6304 115295 0.5174 0 reviewTextFOG -1.10263702 -0.40678142 7 5349 94482 0.5373 0 summaryFOG 0 0.02506183 8 4586 78416 0.5544 0 reviewTextFK 4.37638627 2.3020136 9 3972 66216 0.5716 0 summaryFK 0.12251469 0.13780373 10 3453 56446 0.5834 0 reviewTextARI 5.01873535 2.47411051 11 3004 48278 0.5975 1 summaryARI -0.4099729 -0.10677126 12 2643 41681 0.6065 0 reviewTextCLI 0.31215745 0.21371702 13 2320 36058 0.6154 2 summaryCLI 0.79694206 0.28470061 14 2041 31385 0.6245 2 reviewTextLength 0.30807426 0.3431656 15 1822 27596 0.6355 2 summaryLength 0 0.03837902 16 1616 24270 0.6415 2 bias -4.26391009 -2.21159487 17 1451 21519 0.6465 2 Table 4: Coefficient of Linear Regression Model for 18 1302 19131 0.6551 2 minV otes = 12 and minV otes = 23 19 1192 17274 0.657 2 20 1079 15489 0.6625 2 21 980 13899 0.6698 2 Algorithm Dataset Exec. Correlation RMSE 22 886 12451 0.6772 2 size time coefficient score 23 793 11083 0.6804 2 24 714 9940 0.684 2 Linear Regression 918 0.01 0.6455 0.2005 25 655 9069 0.6897 2 Perceptron 918 0.12 0.5071 0.2635 Table 3: Correlation Coefficient for various minVotes SVM 918 0.25 0.6352 0.218 Linear Regression 3414 0.02 0.7226 0.1957 Perceptron 3414 0.44 0.5135 0.2569 SVM 3414 6.39 0.7199 0.1992 then safely decide to use the model learned on only the selected Linear Regression 10971 0.02 0.6888 0.2023 features. In what follow, we report the results for both sets of Perceptron 10971 1.39 0.5349 0.2658 features. SVM 10971 101.87 0.6846 0.2062 A second lesson learned with our feature selection step is Linear Regression 29808 0.04 0.6303 0.2064 that, interestingly, for both datasets, the features selected include Perceptron 29808 3.81 0.499 0.2401 features that were not present in [16], namely spelling Error Ratio, SVM 29808 829.65 0.627 0.2119 polarity and deviation. With the notable exception of Summary Linear Regression 522801 0.67 0.3352 0.3028 Spelling Error Ratio, these features’ weights remain steady, and Table 5: Test of 3 linear model algorithms on various in some cases relatively important, after feature selection. Quite datasets surprisingly, ReviewTextSER has no impact on helpfulness, while as expected deviation highly contributes negatively to it. 5.2.3 Comparison with the state-of-the-art. As to model ac- curacy, Table 8 shows that the results we obtained are notably comparable, and in some cases slightly better than those reported minV ot es = 12 Before After 0.354 and 0.451 while ours scores at 0.3697 for minV otes = 12 and rating 0.31117594 0.31312877 0.4651 for minV otes = 23 (the higher the better for the Efron’s polarityReviewText 0.36708846 0.3655654 R 2 ). Importantly, their models incorporate the features number polaritySummary 0.05166703 0.05351795 deviation -0.20847153 -0.20951913 of votes and number of helpful votes, which we have deliberately reviewTextSER 0 -0.03361242 not included in ours, since we aim at predicting helpfulness when summarySER -0.28603436 -0.31027976 no such socres are available. reviewTextFOG -1.10263702 N.A Finally, the two datasets minV otes = 12 and minV otes = 23 summaryFOG 0 -0.04014441 achieve comparable MAE and RMSE, even though minV otes = 23 reviewTextFK 4.37638627 0.4228708 shows a better correlation coefficient or Efron’s R 2 . This illus- summaryFK 0.12251469 N.A trates the robustness of our model construction approach to reviewTextARI 5.01873535 N.A larger but more skewed datasets. summaryARI -0.4099729 N.A reviewTextCLI 0.31215745 0.04970302 5.3 Skill evaluation summaryCLI 0.79694206 0.40990694 reviewTextLength 0.30807426 0.3077809 In this section, we show that the model obtained can be used summaryLength 0 0.03922442 to accurately predict the learning of the skill of writing helpful bias -4.26391009 -0.12802418 reviews. Table 6: Models of helpfulness before and after feature se- After training the Knowledge Tracing (KT) model as explained lection for minV otes = 12 in Section 4.3 using a 10 fold cross validation, we acquire the average of the six parameters and the KT model RMSE scores. We also learn one KT per sub-skill and aggregate them to obtain minV ot es = 23 Before After a single probability, as explained in Section 4.3. To be consistent rating 0.37121056 0.37369313 with the learning of the linear regression model, this aggregation polarityReviewText 0.27873667 0.28253483 is done with the weights learned for this model. The results are polaritySummary 0.08006764 0.0821465 reported in Table 9 and Table 10. Each table shows the average deviation -0.1951008 -0.19656865 skill acquisition probability (mean(Ln )) for the actual helpfulness reviewTextSER 0 0 skill, the helpfulness model and the aggregation of the sub-skills. summarySER -0.25242002 -0.29930955 We also report the parameters learned for the KT of the model. reviewTextFOG -0.40678142 N.A summaryFOG 0.02506183 -0.021072 Scores minVotes = 12 minVotes = 23 reviewTextFK 2.3020136 0.13767929 Actual mean(Ln ) 0.968337 0.960511 summaryFK 0.13780373 N.A skill variation(Ln ) 0.025213 0.033238 reviewTextARI 2.47411051 N.A summaryARI -0.10677126 N.A P (L 0 ) 0.007504 0.033457 reviewTextCLI 0.21371702 0 P (T ) 0.030262 0.077669 summaryCLI 0.28470061 0.13525824 mean(P (G)) 0.349992 0.369982 reviewTextLength 0.3431656 0.34368253 Model variation(P (G)) 0.007067 0.0147 summaryLength 0.03837902 0.07231388 mean(P (S )) 0.412574 0.412882 bias -2.21159487 0.13145667 variation(P (S )) 0.016212 0.025905 Table 7: Models of helpfulness before and after feature se- mean(Ln ) 0.783885 0.800687 lection for minV otes = 23 variation(Ln ) 0.090915 0.090820 Aggre- mean(Ln ) 0.999943 0.999991 gated variation(Ln ) 0.000584 0.000122 Evaluation Metrics minV otes = 12 minV otes = 23 a-mKRMSE 0.164619 0.156373 Total Number of Reviews 41681 11083 a-AggKRMSE 0.064818 0.081964 |Before feature selection| Table 9: KT parameters, prediction and predictive accu- Correlation Coefficient 0.608 0.682 racy before feature selection Efron’s R 2 0.3697 0.4651 Mean Absolute Error 0.1521 0.1494 Root Mean Squared Error 0.2014 0.201 For the sake of readability, we recall that RMSE scores are |After feature selection| generated in three ways: Correlation Coefficient 0.6065 0.6804 • RMSE as reported in table 8 represents the error between Efron’s R 2 0.3678 0.4629 the helpfulness model scores and the actual helpfulness Mean Absolute Error 0.1526 0.15 scores, without KT involved at that point. Root Mean Squared Error 0.2017 0.2014 • actual-model Knowledge RMSE (a-mKRMSE) represents Table 8: Evaluation of the models the error between the KT of the actual helpfulness scores and the KT of the helpfulness as computed with the model. • actual-Aggregated Knowledge RMSE (a-AggKRMSE) rep- resents the error between the KT of the actual helpfulness in [16] on datasets of similar size (37,221 Amazon UK reviews scores and the aggregation of the KT scores of each feature were analyzed in that work). In that work, 3 models were con- taken independently (i.e., each sub-skill). structed, and their fitness to the dataset was reported in terms of Before commenting the results of the tests, it is important to Efron’s R 2 scores. Their three models obtained respectively 0.316, note that the average value of the helpfulness skill acquisition Scores minVotes = 12 minVotes = 23 and faked the helpfulness scores with generated random numbers Actual mean(Ln ) 0.968026 0.961189 between 0 and 1. The result, reported in Table 11 confirms that for skill variation(Ln ) 0.02536 0.032609 both datasets the RMSE values are bad. It infers that for random P (L 0 ) 0.004666 0.030936 sequence of numbers as the score of helpfulness, the model fails P (T ) 0.026656 0.075609 to predict the skill of the reviewers (that in this case is expectedly mean(P (G)) 0.348688 0.369544 close to 0.5). Model variation(P (G)) 0.007853 0.015195 mean(P (S )) 0.414712 0.406663 6 CONCLUSION variation(P (S )) 0.014588 0.027189 mean(Ln ) 0.774606 0.804482 In this paper, we experimented with a large dataset of Amazon variation(Ln ) 0.090247 0.093722 book reviews to show that a model of review helpfulness can Aggre- mean(Ln ) 0.96117 0.900541 be used to assess the acquisition of the skill of writing help- gated variation(Ln ) 0.048756 0.055234 ful reviews. Learning such an individual model of procedural knowledge acquisition has the advantages to be less prone to a-mKRMSE 0.170865 0.156404 human variation and subjectivity (e.g., in judging the helpfulness a-AggKRMSE 0.062204 0.050704 of a review) and to not have to define precisely a hard to define Table 10: KT parameters, prediction and predictive accu- skill, that is replaced by a model learned over the dataset. In our racy after feature selection experiments, we modeled the quality of a review by a linear com- bination of metrics stemming from text analysis (like readability, polarity, spelling errors or length) and we use customer declared probability (i.e., the value to be predicted) is high. We conjecture helpfulness as a ground truth for constructing the model. This that this is due to the importance of the filtering, in terms of model achieves comparable to slightly better accuracy results number of reviews per reviewer and number of votes, applied when compared to a state-of-the-art approach. We used Bayesian over the dataset. Knowledge Tracing (KT), a popular model of skill acquisition, to measure the evolution of the ability to write reviews of good 5.3.1 Accuracy of the two KT models. The key observation quality over a period of time. Our tests validated our hypothesis, is that switching to KT achieves very good to excellent RMSE showing that the model of skill acquisition achieves a very good scores, whatever the dataset considered. Notably, predicting the to near perfect accuracy score. skill of writing helpful reviews is done much more accurately Our short term future works include the revision of both the than predicting helpfulness. This allows to answer positively to helpfulness model and the skill acquisition model. In particular, the question expressed at the beginning of this paper: a model the helpfulness model can be extended with advanced features constructed on a large dataset can be used to assess procedural like sentiment analysis or reviewer profiles features, while Deep knowledge acquisition. Interestingly, predicting each sub-skill Knowledge Tracing could be used instead of classical Knowledge (corresponding to each feature) and combining these predictions Tracing. We also want to better understand the relation between to infer the global skill of writing helpful reviews is significantly the linear coefficient learned for the helpfulness model and the better than predicting the skill at the coarse level of the model. KT parameters of the corresponding sub-skills. Long term goals In our test, this combination was naively done with the weights include the generalization of our approach to other datasets and learned using the linear regression algorithm, normalized, bias skills. We are particularly interested in better understanding included, to build the model of helpfulness. It is left as future in what contexts skill acquisition with model building is more works to determine more sophisticated weight combination. relevant than only building the model. Scores minVotes = 12 minVotes = 23 REFERENCES P (L 0 ) 0 0 [1] Arpita Agnihotri and Saurabh Bhattacharya. 2016. Online Review Helpfulness: P (T ) 0.014637 0.016329 Role of Qualitative Factors. Psychology & Marketing 33, 11 (Dec 2016), 1006– mean(P (G)) 0.346532 0.345426 1017. [2] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWord- variation(P (G)) 0.007084 0.014245 Net 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion mean(P ((S )) 0.409464 0.409721 Mining. In LREC. [3] Kathleen M. Cauley. 1986. Studying Knowledge Acquisition: Distinctions variation(P (S )) 0.016246 0.029494 among Procedural, Conceptual and Logical Knowledge. In 67th Annual Meeting mean(Ln ) 0.518769 0.518702 of the American Educational Research Association. variation(Ln ) 0.110546 0.106100 [4] Judith A Chevalier and Dina Mayzlin. 2006. The effect of word of mouth on sales: Online book reviews. Journal of marketing research 43, 3 (2006), a-mKRMSE 0.673532 0.669856 345–354. Table 11: KT parameters, prediction and predictive accu- [5] Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. Journal of Applied Psychology 60, 2 (1975), 283. racy for random sequences of helpfulness [6] Albert T Corbett and John R Anderson. 1994. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4, 4 (1994), 253–278. [7] Yossi Ben David, Avi Segal, and Ya’akov (Kobi) Gal. 2016. Sequencing edu- 5.3.2 Comparison with random sequences of helpfulness scores. cational content in classrooms using Bayesian knowledge tracing. In LAK. The small RMSE indicates that the KT model is good at predicting 354–363. [8] Anindya Ghose and Panagiotis Ipeirotis. 2009. The EconoMining project at the learning of the writing skill of the reviewers. However, in NYU: Studying the economic value of user-generated content on the internet. order to validate the hypothesis that these good results do not Journal of Revenue and Pricing Management 8, 2-3 (2009), 241–246. [9] Yue Gong, Joseph E. Beck, and Neil T. Heffernan. 2010. Comparing Knowledge come from an intrinsic smoothing behavior of the KT model, we Tracing and Performance Factor Analysis by Using Multiple Model Fitting ran the model on random sequences of helpfulness score. To this Procedures. In ITS. 35–44. end, we generated as many sequences as the original data set has [10] Robert Gunning. 1952. The technique of clear writing. (1952). [11] William J. Hawkins, Neil T. Heffernan, and Ryan Shaun Joazeiro de Baker. 2014. Learning Bayesian Knowledge Tracing Parameters with a Knowledge Heuristic and Empirical Probabilities. In ITS. 150–155. [12] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW. 507–517. [13] Hong Hong and Di Xu. 2015. Research of online review helpfulness based on negative binary regress model the mediator role of reader participation. In 2015 12th International Conference on Service Systems and Service Management (ICSSSM). 1–5. [14] Jingxian Jiang, Ulrike Gretzel, and Rob Law. 2010. Do Negative Experiences Always Lead to Dissatisfaction? - Testing Attribution Theory in the Context of Online Travel Reviews. In ENTER. 297–308. [15] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical Report. Naval Technical Training Command Millington TN Research Branch. [16] Nikolaos Korfiatis, Elena García-Bariocanal, and Salvador Sánchez-Alonso. 2012. Evaluating content quality and helpfulness of online product reviews: The interplay of review helpfulness vs. review content. Electronic Commerce Research and Applications 11, 3 (2012), 205–217. [17] Yang Liu, Xiangji Huang, Aijun An, and Xiaohui Yu. 2008. Modeling and Predicting the Helpfulness of Online Reviews. In ICDM. 443–452. [18] Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In WWW. 897–908. [19] Susan M. Mudambi and David Schuff. 2010. What Makes a Helpful Online Review? A Study of Customer Reviews on Amazon.com. MIS Quarterly 34, 1 (2010), 185–200. [20] Philip I. Pavlik, Hao Cen, and Kenneth R. Koedinger. 2009. Performance Factors Analysis - A New Alternative to Knowledge Tracing. In AIED. 531–538. [21] Radek Pelánek. 2015. Metrics for Evaluation of Student Models. In EDM. 19. [22] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sa- hami, Leonidas J. Guibas, and Jascha Sohl-Dickstein. 2015. Deep Knowledge Tracing. In NIPS. 505–513. [23] RJ Senter and Edgar A Smith. 1967. Automated readability index. Technical Report. Univ. Cincinati. [24] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Net- work. In HLT-NAACL. [25] Yutao Wang and Neil T. Heffernan. 2013. Extending Knowledge Tracing to Allow Partial Credit: Using Continuous versus Binary Nodes. In AIED. 181–188. [26] Jianan Wu. 2017. Review popularity and review helpfulness: A model for user review effectiveness. Decision Support Systems 97 (2017), 92–103. [27] Philip Fei Wu, Hans van der Heijden, and Nikolaos Korfiatis. 2011. The Influences of Negativity and Review Quality on the Helpfulness of Online Reviews. In ICIS.