Can models learned from a dataset reflect acquisition of
        procedural knowledge? An experiment with automatic
               measurement of online review quality
                           Martina Megasari                                                            Nicolas Labroche
                           Pandu Wicaksono                                                              Patrick Marcel
                             Chiao Yun Li                                                              Verónika Peralta
                          Clément Chaussade                                                        University of Tours, France
                             Shibo Cheng                                                       firstname.lastname@univ-tours.fr
                       University of Tours, France
                 firstname.lastname@etu.univ-tours.fr
ABSTRACT                                                                            difficult to define and assess. Reviews can be voted helpful or
Can models learned from a dataset reflect how good are humans                       not by customers, but this assessment is subjective and as such
at mastering a particular skill? This paper studies this question in                subject to variations over time, and it is difficult to construct a
the context of online reviews writing, where the skill corresponds                  model that accurately predicts helpfulness of a review [16].
to the procedural knowledge needed to write helpful reviews.                           In this paper, we show that it is possible to benefit from such
To this end, we model the quality of a review by a combination                      very large datasets to learn an individual model of procedural
of various metrics stemming from text analysis (like readability,                   knowledge acquisition. The resulting model of knowledge has
polarity, spelling errors or length) and we use customer declared                   several nice properties: (1) it is not prone to the usual bias caused
helpfulness as a ground truth for constructing the model. We                        by a single small set of evaluators that might be non represen-
use Knowledge Tracing, a popular model of skill acquisition, to                     tative or produce a subjective evaluation, (2) it avoids defining
measure the evolution of the ability to write reviews of good                       explicitly the procedural knowledge at hand that is replaced by a
quality over a period of time. While recent studies have tried to                   statistical model learned over the large dataset. As a consequence,
measure the quality of a review and correlate it to helpfulness,                    the larger the dataset, the more accurate is the modeling of the
to the best of our knowledge, our work is the first to address this                 procedural knowledge, and the better the evaluation of the skill
question as the exercise of a reviewer’s skill over a sequence of                   for a user is.
reviews. Our experiments on a set of 41,681 Amazon book reviews                        To illustrate this, we experiment a use case with a dataset of the
show that it is possible to accurately assess the individual skill                  aforementioned Amazon on-line product reviews. We chose this
acquisition of writing a helpful review, based on a statistical                     use case because it is prototypical of how procedural knowledge
model of the procedural knowledge at hand rather than human                         influences decision making. For instance, Mayzlin and Chevalier
evaluations prone to subjectivity and variations over time.                         studied the effects of on-line book reviews of Amazon.com and
                                                                                    Barnesandnoble.com and found positive correlation between the
                                                                                    reviews and the transactions of the book [4]. This means that
1    INTRODUCTION                                                                   the reviewers opinion play an important role in users’ decision
In today’s era of big and open data, plenty of datasets are an-                     on the transaction. Automatic measurement of the reviewer skill
alyzed to derive models mimicking humans by using machine                           may be beneficial to predict how helpful the review is. A skillful
learning techniques. The representation and assessment of user                      writer is assumed to be able to write a good review, which can
knowledge opens new possibilities for big data analytics, as dif-                   help the customer to make a better decision on the transaction.
ferentiating among novice and expert users, taking advantage of                        To motivate our approach, suppose that we want to determine
user experience for recommending (e.g. products or actions), cal-                   whether a reviewer is assumed to master the skill of writing
culating advanced scores (e.g. credibility), assessing the quality                  helpful reviews. This is preferable to trying to predict helpfulness
of users’ analysis, etc. In this paper we focus on the assessment                   of the reviews, because of the high variability of the reviewer
of procedural knowledge from large data collections.                                profiles, reviews and votes received by reviews. However this
   Procedural knowledge is the knowledge about how to do some-                      skill corresponds to procedural knowledge and it is difficult to
thing. Different from declarative knowledge, that is often ver-                     define. Therefore to evaluate the skill of each reviewer, we use
balized, application of procedural knowledge may not be easily                      the classical Knowledge Tracing model. But instead of using the
explained [3]. Models exists to evaluate procedural knowledge                       Knowledge Tracing directly over the votes received by reviews,
acquisition, like for instance the popular Bayesian Knowledge                       we apply it over a model of helpfulness learned from each review.
Tracing [6].                                                                        Our research question is: can this model of helpfulness be used
   Many open datasets illustrate the application of procedural                      to assess the skill accurately? Consider the 4 curves displayed
knowledge. For instance, Amazon review datasets like those                          in Figure 1. These curves are related to the evolution over time
provided by He and McAuley [12] contain customer written                            of the skill of writing helpful reviews of a particular reviewer
reviews, where the skill of writing helpful reviews is an example                   (randomly extracted from the Amazon book review dataset). The
of application of procedural knowledge. However, this skill is                      helpfulness curve is the normalized score of helpfulness received
                                                                                    by the 20 reviews written by this reviewer. The model curve is
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,         the helpfulness score as predicted for this reviewer by a model
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted   learned over the entire dataset. The KT helpfulness curve predicts
under the terms of the Creative Commons license CC-by-nc-nd 4.0.
                           Figure 1: Evolution of helpfulness for a reviewer and different models of it


the probability that this reviewer has acquired the skill of writing    (FOG) [10] aims to estimate the years of formal education a per-
helpful reviews, computed with the helpfulness score. The KT            son needs to understand the text during the first reading. The
model curve is the same probability computed with the model.            Flesch Reading Ease (FK) [15] indicates the difficulties of a text
On this example, it is obvious that even though the skill can           using the number of words, number of sentences and number
be considered acquired, helpfulness score is difficult to predict       of syllables. Higher values indicate better readability. The Auto-
due to subjectivity of the voters. On the other hand, a model of        mated Readability Index (ARI) [23] measures the approximate
helpfulness can be learned to predict if the skill has been acquired.   representation of the US grade level needed to understand the
   The contributions of this paper are the following: (1) assuming      text. The Coleman-Liau Index (CLI) [5] is the approximation of
that writing helpful reviews is a hard to define skill, we propose a    US grade level needed to understand the text. More background
model for it. We use low level features of the on-line review such      on readability tests can be found in [16].
as rating, spelling error ratio or readability score to build the          Previous works have studied the evaluation of online reviews
model that infers a high level and human-related feature which          due to the popularity of online marketing nowadays. Authors of-
is helpfulness. This model is learned over the entire dataset and       ten pay attention to the influence of online reviews on helpfulness.
can be used to predict the helpfulness of future reviews for one        Korfiatis et al. investigated the interplay between helpfulness,
particular reviewer. (2) Using Knowledge Tracing, we show that          rating score and qualitative characteristics of the review text of
this model can be used to assess skill acquisition without relying      37,221 online reviews collected from Amazon UK during March
on human entered votes. In particular, we show that this model,         to April in 2008 [16]. The authors theorize that helpfulness re-
although learned over the entire dataset, is accurate enough to         lates to a model with three aspects: conformity (relation between
predict if the skill is acquired by each individual reviewer. To        the review text and the rating), understandability (readability of
the best of our knowledge, this work is the first to evaluate a         the review text) and expressiveness (length of the review text).
reviewer’s skill over a sequences of reviews with Knowledge             The authors formulate several hypotheses and perform linear re-
Tracing.                                                                gression to validate the relationship between the metrics derived
   The remainder of the paper is organized as follows. Section          from reviews and the helpfulness of the reviews. Regarding un-
2 discusses related works. Section 3 defines the features used to       derstandability, four common readability scores - indicating the
build the model of helpfulness. Section 4 details our approach.         education level the readers need to have in order to understand
Section 5 explains how the experiment is performed to build the         the content - are computed: FOG, FK, ARI and CLI. Their results
model and exposes the results. Finally, Section 6 concludes the         indicate that helpfulness of a review is directionally affected by
paper and discusses some possible future work.                          its qualitative characteristics and in particular by review text
                                                                        readability. Precisely, the relationship between reviews with av-
2     RELATED WORKS AND BACKGROUND                                      erage length and their readability scores holds for both moderate
We first review recent works on online review evaluation and            and extreme reviews. In addition, readability has more impact
then describe the Bayesian Knowledge Tracing model and some             on the length of the reviews. In their work, metrics related to
of its extensions.                                                      polarity, summary text of reviews and rating deviation (between
                                                                        the average rating and the reviewer’s one) are not considered.
2.1    Online review evaluation                                         Moreover, due to the purpose of the work, the books having spe-
                                                                        cial offers are not considered to avoid the price effect. In our work,
Readability tests play an important role in online review evalua-       such books are chosen due to the amount of reviews resulting
tion. Various indexes have been proposed to quantify readability        from this price effect.
of an English text. Most of these indexes are related to the level of      Based on the 7,659 book reviews on Amazon UK, Wu et al.
studies a person needs to understand the text at the first reading,     explored whether a negative bias exists in terms of evaluating the
according to American standard. They are computed considering           helpfulness [27]. The assumption was that negative reviews may
the number of words, number of sentences, number of syllables           be more helpful than positive ones. After applying a regression
or number of characters as components. The Gunning-Fog Index
model controlling factors such as readability and length of the         To the best of our knowledge, no work ever focused on the
reviews, the result shows that the assumption is not yet readily      evolution of the quality of review text under the angle of skill
applicable to online reviews.                                         acquisition, with a model learned only on the review content.
   Mudambi and Schuff analyzed 1,587 reviews from Amazon.com
[19] to understand how review extremity, review depth and prod-       2.2    Knowledge Tracing Models
uct type affect the perceived helpfulness of the review. Their        The Bayesian Knowledge Tracing model was proposed by Cor-
helpfulness model is based on features rating, review text word       bett and Anderson, using Bayesian network to assess people’s
count, total votes and product type. Product type is either Ex-       procedural knowledge acquisition or simply put “skill level” [6].
perience goods or Search goods, where Experience goods are            An individual’s grasp of the procedural knowledge is expressed
products that require sampling or purchase in order to evaluate       as a binary variable expressing whether the corresponding skill
product quality. Books are examples of experience goods. They         has been mastered or not. The knowledge of an individual can-
found that for experience goods, moderate reviews are more            not be directly observed, but it can be induced by observing the
helpful than extreme reviews (whether they are strongly positive      individuals’ answers to a series of questions (or opportunities
or negative). In contrast, it has been observed that reviews closer   to exercise the skill) in order to guess probability distribution of
to the general opinion of people (average rating score) may be        knowledge mastering. Observation variables are also binary: the
considered more helpful by the potential buyers [14].                 answer to the question is either correct or wrong.
   Mc Auley and Leskovec [18] propose a latent-factor model               Specifically, the Knowledge Tracing model has four parame-
for recommending products that may be preferred by the users          ters, namely, two learning parameters, P (L 0 ) and P (T ), and two
according to their experience level at the moment. The model          performance parameters, P (G) and P (S ). P (L 0 ) is the probability
evaluates the evolution of users’ experiences and is based on         that the skill has been mastered before answering the questions.
the rating that users give to products. Unlike other works on         P (T ) is the knowledge transformation probability: the probability
temporal dynamics, which rely on the hypothesis that two users        that the skill will be learned at each opportunity to use the skill
rating a product at the same time will provide the same rating, Mc    (i.e., the transition from not mastered to mastered). P (G) is the
Auley and Leskovec’s model takes users’ personal development          probability of guess: in the case of knowledge not mastered, the
into consideration in order to evaluate the expertise degree of the   probability that the individual can still answer correctly. P (S ) is
reviewers. Experiments showed for example that experts’ ratings       the probability of slip, i.e. to fail while the skill is already mas-
are easier to predict and are more similar to each other. While       tered. The model uses these parameters to calculate the learning
close to our work in the idea of taking the evolution of the user     probability after each question to monitor individual’s knowledge
into account, this work focuses on ratings and not helpfulness,       status and predict their future learning probability of knowledge
and therefore does not consider the linguistic aspect of review       acquisition using a Bayesian Network.
text.                                                                     The probability that a skill L at question i + 1 is mastered,
   Liu et al. considered a complex model learned using non-linear     denoted P (Li+1 ) is the sum of two probabilities: (1) the posterior
regression, that combines the reviewer’s expertise (based on the      probability that the skill was already learned, contingent on the
number of similar reviews written in the past), the writing style     evidence at time i, i.e. the i th opportunity to evaluate the skill,
of the review (characterized with part of speech tagging and          that can either be Correct or Incorrect, and (2) the probability that
counting the number of words in each tag), and the timeliness
                                                                      the knowledge changes from not mastered to mastered at the i th
of the review [17]. They showed that the three factors predict
                                                                      opportunity. It can be shown in the following formula:
accurately helpfulness, over a dataset of 22,819 reviews collected
from IMDB.
   In [26], review helpfulness is considered through five features     P (Li+1 ) = P (Li |Evidencei ) + (1 − P (Li |Evidencei )) ∗ P (T ) (1)
including user profile aspects (age, verified purchase) together
with rating, text length and the rank of the review in the webpage.      where:
A model learned on 12,756 reviews was shown to be reasonably
robust.                                                                                                             P (Li ) ∗ P (¬S )
                                                                       P (Li |Evidencei = Correct ) =
   Agnihotri and Bhattacharya explored how the helpfulness of                                            P (Li ) ∗ P (¬S ) + P (¬Li ) ∗ P (G)
online reviews is affected by content readability (FK Index), sen-
timent analysis and the number of reviews written by a reviewer                                                      P (Li ) ∗ P (S )
[1]. It was observed on 1608 Amazon reviews that the content          P (Li |Evidencei = Incorrect ) =
                                                                                                         P (Li ) ∗ P (S ) + P (¬Li ) ∗ P (¬G)
readability and text sentiment of the reviews follow curvilinear
relationship with review helpfulness. Reviews whose readabil-             Due to its predictive accuracy, Corbett and Anderson’s Bayesian
ity score are very high or sentiment are very good would be           Knowledge Tracing is one of the most popular models. How-
perceived less helpful.                                               ever, several challenges, including local minimum, degenerate
   Hong and Xu analyze the impact of review message and re-           parameters and computational costs during fitting, still exist.
viewer profile on the helpfulness of 2997 online reviews collected    Hawkins et al. proposed a fitting method avoiding these prob-
from Douban.com [13]. Using negative binomial regression, the         lems while achieving a similar predictive accuracy, and evaluated
authors show that reader participation is positively related to on-   it against one of the most popular fitting methods: Expectation-
line review helpfulness; Reader participation fully mediates the      Maximization [11]. In this extension, the parameters are fitted
effect of reviewer expertise history on online review helpfulness     by estimating the most likely opportunity at which each individ-
and partially mediates the effects of other three metrics: average    ual learned the skill. Learner’s performance is thus annotated
rating, title depth and reviewer network centrality.                  with an estimate of when the skill is learned, assuming that a
                                                                      known state can never be followed by an unknown state. This
                                                                      annotation is used to construct knowledge sequences, that when
compared with the actual performance sequence allows to em-                 Feature name            Category        Applies to     Range
pirically derivate the model’s four parameters.                             rating                  Conformity          all         [1, 5]
                                                                            polarityReviewText      Conformity         text        [-1,1]
   As aforementioned, traditionally, the performance of an indi-
                                                                            polaritySummary         Conformity       summary       [-1,1]
vidual is presented in binary value, correct or wrong, which does           deviation               Conformity          all         [0,5]
not account for all the cases of skill learning situation. Wang             reviewTextSER           Readability        text         [0,1]
et al. proposed to extend the Knowledge Tracing model by re-                summarySER              Readability      summary        [0,1]
placing the discrete binary performance node with continuous                reviewTextFOG           Readability        text          R+
partial credit node [25]. In this extension, it is assumed that P (G)       summaryFOG              Readability      summary         R+
and P (S ) follow two Gaussian distributions, that are described            reviewTextFK            Readability        text           R
respectively by their means and standard deviations. Prediction             summaryFK               Readability      summary          R
of the performance node also follows a Gaussian distribution,               reviewTextARI           Readability        text           R
in which the mean value is used for the prediction. Noticeably,             summaryARI              Readability      summary          R
                                                                            reviewTextCLI           Readability        text           R
the standard deviation contains the information of how good
                                                                            summaryCLI              Readability      summary          R
the prediction is. Experiments with this extension show that by             reviewTextLength       Extensiveness       text          N+
relaxing the assumption of binary correctness, the predictions of           summaryLength          Extensiveness     summary         N+
an individual’s performance can be improved.                                       Table 1: Summary of the main features
   These two improvements of the Knowledge Tracing model (in
the fitting method and the use of partial credits) were used suc-
cessfully in sequencing educational content to students [7]. We
conclude this section by noting that other models exist for pre-          Feature name               Min       Max        Mean     Std Dev.
dicting a learner’s skill. Specifically, Performance Factor Analysis      rating                         1          5      4.112      1.183
[20] uses standard logistic regression with the student perfor-           polarityReviewText        -0.875     0.875       0.027      0.052
mance as dependent variable. Interestingly, it is shown in [9] that       polaritySummary           -0.875          1      0.029      0.137
Knowledge Tracing can achieve comparable predictive accuracy              deviation                      0     3.786       0.452      0.615
as Performance Factor analysis. Finally, Deep Knowledge Tracing           reviewTextSER                  0        0.5      0.009      0.008
                                                                          summarySER                     0          1      0.014      0.038
[22] uses Recurrent Neural Networks to model student learning,
                                                                          reviewTextFOG                  0     740.8      13.983       8.45
with the advantage of not having to set explicit probabilities for
                                                                          summaryFOG                     0      42.4       9.524     10.038
slip and guess. However these models need very large datasets             reviewTextFK           -1788.235    121.22       58.96     24.407
to learn the latent state from sequences, and most importantly,           summaryFK               -1824.58   121.728      59.537     51.228
the encoding of the input vectors depends on an upper bound on            reviewTextARI             -6.837   919.088       11.41     10.374
the number of exercises which does not directly fit our context.          summaryARI                -16.22    261.67       5.162      7.769
                                                                          reviewTextCLI             -22.24    39.133        8.64      2.549
3     FEATURES AND METRICS                                                summaryCLI                -58.13     307.6       5.387      9.417
                                                                          reviewTextLength               0    32669     1152.094   1261.787
Consistently with the previous work of Korfiatis et al. [16], our         summaryLength                  1       257      28.875     16.786
model of helpfulness is based on features that are grouped in                       Table 2: Empirical Values of Metrics
three categories: Conformity, Understandability and Extensive-
ness, with additional features compared to [16]. We derive met-
rics, i.e., numerical attributes to be used in the definition of our
model, from these features. Conformity expresses the consistency
of a review being written. In addition to the classical rating, we      which indicates the positiveness or negativeness of a review as a
add two metrics in this category: Polarity and Deviation. Under-        metric. Besides, the extremity of the rating given by the reviewer
standability measures how good is the quality of the written text       may indicate that the reviewer is biased and has a subjective
in terms of readability. We derived five metrics to measure the         point of view on the product being reviewed. Extremely high
score: Spelling Error Ratio and 4 readability metrics (FOG, FK,         and low rating is associated with lower levels of helpfulness than
ARI, and CLI). Finally extensiveness refers to the length of the        reviews with moderate rating [19]. In contrast, reviews closer
review. In total, 16 metrics are defined, since length and read-        to the general opinion of people (average rating score) may be
ability metrics apply both to the review text and summary. We           considered more helpful by the potential buyers [14]. From this
detail them below, a summary of the features used in the experi-        perspective, we derived the Deviation score, quantifying how
ments with their name, category, theoretical and empirical range        much different the rating given by the reviewer is to the average
is provided in Tables 1 and 2.                                          rating.

                                                                           Rating. The Rating of a review is the user input quantitative
3.1    Conformity
                                                                        indicator of the quality of the item reviewed (e.g., rating is from
Metrics in this category relate to the consistency of the review. As    1 to 5 for Amazon Book Reviews).
the content of a review consists in a rating and a written text, we
can derive a relation between them. A rating should correspond             Polarity. Polarity of a text is measured by using a word list that
to the written review and vice versa, hence difference between          indicates the positivity, negativity and objectivity of each synset.
these two contents might indicate that the review is inconsistent.      Polarity score of a word with the part of speech is calculated as
For example, a review having 5 stars rating and very negatively         the score of the positivity subtracted by the score of negativity.
written is inconsistent. Needless to say, inconsistent reviews          The range of the value of polarity is between -1 and 1, -1 indicates
may lead to lower score of helpfulness due to the confusion it          that the written text is very negative and 1 indicates that the
brings. From this perspective, we consider Polarity of the text,        written text is very positive.
   Deviation. Deviation is calculated as the absolute difference       the actual importance of each attribute. We use Min-Max Scaling
between the rating of a review and the average rating of the item      normalization strategy.
reviewed.
                                                                       4.2    Model construction
3.2    Readability                                                     We build our model to measure the quality of a review, where
Metrics in this category relate to the effort needed to understand     quality is defined by the helpfulness ratio of the review:
the text of the review. This is measured based on the number of
                                                                                                       nbHelp f ulV otes
spelling errors in the written text, which is expected to be nega-                     help f ulness =                                   (6)
tively correlated to helpfulness [8], and with various readability                                          nbV otes
measures.                                                                 where nbHelp f ulV otes is is the number of positive votes re-
                                                                       ceived by the review and nbV otes is the total number of votes
   Spelling Error Ratio (SER). Spelling Error Ratio is the number      received by the review. This constitutes the class attribute value of
of spelling errors divided by the text length.                         a supervised machine learning method to build our simple model
   Gunning-Fog Index (FOG). The FOG [10] aims to estimate the          of helpfulness as a linear combination of the metrics. Thus, our
years of formal education (according to the American system)           predicted output variable y ∈ R will be expressed as a weighted
a person needs to understand the text during the first reading.        sum of input features x i , ∀i ∈ [1, m], m being the number of
This index uses the number of words, the number of sentences           features:
                                                                                                    m
and the number of complex words to measure the years. A word                                       X
                                                                                              y=      ωi × x i + b                       (7)
is considered as a complex word if the word is using more than                                     i=1
two syllables.
                                                                       where ωi ∈ R is the weight reflecting the contribution of feature
                     nbW ords            nbComplexW ords               i to the overall decision and b ∈ R stands for the bias.
    FOG = 0.4[(                ) + 100(                      )] (2)
                   nbSentences                nbW ords                    The intuition behind restricting our study to linear models is
   Flesch Reading Ease (FK). The FK index [15] indicates the diffi-    mainly for two reasons. First, these models are more simple and
culties of a text using the number of words, number of sentences       can be calculated more efficiently. Second, they allow for a direct
and number of syllables.                                               interpretation of the contribution of each feature to the final
                             nbW ords              nbSyllables         helpfulness decision. To this end, we try a variety of methods
  F K = 206.835 − 1.015(                 ) − 84.6(             ) (3)   and keep the one best fitting the dataset.
                           nbSentences              nbW ords
                                                                          In our tests, error measurement is done using classical corre-
   Automated Readability Index (ARI). The ARI [23] approximates
                                                                       lation coefficient, Efron’s R 2 , MAE and RMSE scores.
the US grade level needed to understand the text. This index uses
number of characters, number of words and number of sentences.
                                                                       4.3    Skill evaluation
                 nbCharacters             nbW ords
   ARI = 4.71(                 ) + 0.5(               ) − 21.43 (4)    In this last phase, we apply Knowledge Tracing (KT) to sequences
                    nbW ords            nbSentences
                                                                       of reviews in order to estimate reviewers’ skills. We proceed as
   Coleman-Liau Index (CLI). The CLI [5], Like ARI, is the ap-         follows: We group the reviews by reviewers, obtaining one se-
proximation of US grade level needed to understand the text.           quence of reviews per reviewer. Each review is considered as
This index also uses number of characters, number of words and         an opportunity to learn the skill (i.e. being able to write useful
number of sentences as components.                                     reviews) and is graded with a score, representing the reviewer’s
                  nbCharacters           nbSentences                   performance (i.e. how useful is the review). We compute two
    CLI = 5.89(                 ) − 0.3(               ) − 15.8 (5)
                    nbW ords               nbW ords                    KT scores: (i) directly from helpfulness ratings, and (ii) from the
                                                                       learned helpfulness model. In the former, the reviewer’s perfor-
3.3    Extensiveness
                                                                       mance is calculated as the helpfulness score of the review. In the
The textual part of the review consists of a text and a summary        latter, it is predicted by the helpfulness model. In both cases, the
of this text. For both we measure the length in characters, respec-    final score, output by KT model, expresses the probability that
tively called Review Text Length and Summary Length.                   the skill is mastered by the reviewer.
                                                                          We use the continuous version of KT described in [25] since
4     METHODOLOGY                                                      the scores we will consider are continuous. In this extension of
Our approach is divided into three phases: metric extraction,          KT, P (G) and P (S ) are assumed to follow a Gaussian distribu-
model construction and skill evaluation. These phases are detailed     tion, and as such, they are represented by a mean value and a
below.                                                                 standard deviation. As a consequence, and opposed to binary
                                                                       KT, the prediction P (Ln ) also follows a Gaussian distribution,
4.1    Metric extraction and feature selection                         whose mean expresses the value of the prediction and whose
In the first phase, we calculate for each review the scores for the    standard deviation expresses the confidence attached to this pre-
metrics presented in Section 3, that we use to build the model of      diction. To learn the 6 parameters of continuous KT, we extend
helpfulness. Then we apply feature selection to reduce the set         the approach proposed by Hawkins et al. [11] so that it outputs
of metrics by removing redundant ones, while avoiding losing           estimates of P (G) and P (S ) described by a mean and a standard
too much information on the data set. We use a heuristic greedy        deviation. Then, based on these 6 parameters, the estimation of
method by calculating all the pairwise correlations between met-       each skill acquisition P (Ln ) is performed by running 100 tests,
rics. For those metrics that are highly correlated, only the ones      with randomly generated values for P (G) and P (S ) following
highly correlated with the helpfulness score will be kept, the         their respective distribution. From these 100 P (Ln ) estimates, we
others being discarded. Finally, we normalize the scores in order      compute a mean and a standard deviation following the normal
to be independent of attribute ranges and units and highlight          hypothesis.
   However, the KT efficiency is known to be dependent on the                and the dataset has a wide enough variety, from helpful reviews
granularity of skills that are fed to the model: generally, the more         and not helpful reviews. Moreover, the standard deviations of
focused the skills, the better the prediction of skill acquisition. In       the features indicate that creating a model from this dataset is
this respect, it is possible to consider that each of the features that      difficult.
fed our linear predictive model of helpfulness can be considered
as a sub-skill related to helpfulness. For this reason, we define            5.2    Model construction
two distinct tests to evaluate the learned model of helpfulness:
                                                                             We now describe how the model of helpfulness is learned from
In the first we simply use the output of the linear regression
                                                                             the dataset. Consistently with [16] our model of helpfulness is
model as the predicted helpfulness for a review. In the second,
                                                                             constructed as a linear combination of the metrics extracted from
we consider each feature metric as a possible sub-skill evaluation
                                                                             the review text and summary. More precisely, as explained in
of the reviewer. We then learn as many KT models as there are
                                                                             Section 4, we use a linear classifier to learn a weight for each
features. In the end, we have the probabilities that sub-skills cor-
                                                                             of the features introduced in the previous section, in order to
responding to each feature are acquired. These sub-skills scores
                                                                             understand its contribution to the helpfulness score. We tested
are then aggregated into one single skill acquisition probability.
                                                                             three different approaches to learn the feature’s weights: Linear
   The global validation of our proposal is given by measuring
                                                                             Regression, Perceptron and Support Vector Machine with linear
the error between the KT based on real ratings, the KT based on
                                                                             kernel. We used out-of-the-shelf Weka algorithms with 10-fold
the general linear model and the KT based on aggregated feature-
                                                                             cross validation. Table 5 summarizes the results of those tests, for
based models. This error is evaluated by RMSE, which has been
                                                                             various size of dataset selected according to minimum number
shown to be the strongest performance indicator for binary KT
                                                                             of votes for the reviews (918 reviews with number of votes being
with significantly higher correlation than Log Likelihood and
                                                                             at least 200, up to 522804 reviews with at least 1 vote). Results
Area Under Curve [21].
                                                                             for Perceptron and SVM are not reported for the largest dataset
                                                                             due to too much time consumption. The results show that linear
5     EXPERIMENTS                                                            regression achieves a good compromise of accuracy and compu-
Our implementation is done in Java 8, with Weka 3.8 for model                tation time, with better accuracy on smaller datasets and better at
learning. We used our own implementation of the knowledge                    handling larger datasets with no significant drop in accuracy. We
tracing, whose code has been made available through Github1                  therefore chose to work with linear regression in what follows.
as one contribution of this paper. For polarity extraction, we use
SentiWordNet [2], that lists the positivity, negativity and objec-               5.2.1 Preprocessing. We recall that our definition of helpful-
tivity of each synset (set of synonyms). SentiWordNet provides               ness is the number of helpful votes divided by the total number
the score of each word with the part-of-speech, hence we do POS              of votes, hence, a review with large number of votes is a gen-
tagging for each word using Stanford POS tagging library [24].               uine representation of helpfulness from a customer’s point of
                                                                             view. But a review with only one vote, being a helpful one, can
5.1     Dataset description                                                  still obtain a maximum helpfulness score, which is not desirable.
The dataset we use for experiments is Amazon Book Review Data                Filtering the dataset by number of votes becomes necessary. In
provided by Julian McAuley from UCSD [12]. We select the book                order to find the appropriate minimum number of votes for each
category in this dataset resulting in 22,507,155 total reviews.              review, we iterated this parameter from 1 to 25 for the most im-
   As one of our goals is to measure the evolution of the ability to         portant features of our model (i.e., after feature selection), and
write reviews of good quality, we need to obtain for each reviewer           checked the results in terms of correlation and expressiveness
a sequence of reviews long enough to observe that evolution.                 (contribution of each metric), reported in Table 3. We decided
Therefore, we define reviewers with less than 30 reviews as not              to choose 2 datasets among those tested, based on, first, expres-
so active reviewers and filter them out. In addition, we only                siveness (determined by the non zero value of coefficient in the
consider the reviews that have been scored by customers by                   linear model), and second, correlation coefficient (that indicates
means of votes (helpful review or not).                                      to which extent the model matches the dataset), for more than
   To confirm the hypothesis that few reviewers have written                 10,000 reviews. The best interestingness and correlation coef-
many reviews and that many reviewers have written few reviews,               ficients were obtained for, respectively at least 12 votes and at
we plotted on Figure 2 the number of reviewers (on a logarith-               least 23 votes. In this phase, we are not sure about the effect of
mic scale) by number of reviews, for reviewers with more than                these parameters on knowledge tracing model. Therefore, we
30 reviews. Each points (x, y) in this figure represents that x              keep two data sets, to see which can give a better result in knowl-
reviewers have written y reviews. Furthermore, we found review-              edge tracing model. In what follows, the first dataset is called
ers writing so much reviews that are dubious and possibly bias               minV otes = 12 and consists of 41,681 reviews while the second
their reviews. For instance, reviewer of ID A14OJS0VWMOSWO                   dataset is called minV otes = 23 and consists of 11,083 reviews.
wrote 43,201 reviews with an average score of 4.9991 out of 5.                   Using linear regression on the two datasets minV otes = 12 and
The reviewer received 240,262 votes, of which 199,573 are helpful.           minV otes = 23 results in the models described in Tables 6 and 7
In our opinion, such reviewers introduce a bias in the dataset.              respectively. The models constructed are evaluated with correla-
Hence we limited our experiment and selected reviewers that                  tion coefficient, Efron’s R 2 , MAE and RMSE scores, reported in
have 30 to 50 reviews.                                                       Table 8.
   We calculate the score of each feature from the dataset and
                                                                                5.2.2 Feature selection impact. We then proceed to feature
calculate their standard deviations, reported in the last column
                                                                             selection, as described in Section 4.1. As shown in table 8, our
of Table 2. The standard deviation of the helpfulness, that varies
                                                                             models before and after feature selection achieve very similar
in [0,1], is 0.32, which indicates that the score is quite spread out
                                                                             accuracy results. If efficiency in learning the model is an issue,
1 https://github.com/Cubiccl/Continuous-Knowledge-Tracing/releases/tag/1.0   or if the model should remain as simple as possible, one can
                                        Figure 2: Number of reviews by number of reviewers

 minVotes    Number of    Number of     Correlation   Number of               Metrics              minV ot es = 12   minV ot es = 23
             reviewers    reviews       coefficient   zero coefficients       rating               0.31117594        0.37121056
 1           13820        522801        0.3352        0                       polarityReviewText   0.36708846        0.27873667
 2           13556        350158        0.3917        0                       polaritySummary      0.05166703        0.08006764
 3           11312        247394        0.4351        0                       deviation            -0.20847153       -0.1951008
 4           9060         184092        0.4649        0                       reviewTextSER        0                 0
 5           7408         142572        0.4946        0                       summarySER           -0.28603436       -0.25242002
 6           6304         115295        0.5174        0                       reviewTextFOG        -1.10263702       -0.40678142
 7           5349         94482         0.5373        0                       summaryFOG           0                 0.02506183
 8           4586         78416         0.5544        0                       reviewTextFK         4.37638627        2.3020136
 9           3972         66216         0.5716        0                       summaryFK            0.12251469        0.13780373
 10          3453         56446         0.5834        0                       reviewTextARI        5.01873535        2.47411051
 11          3004         48278         0.5975        1                       summaryARI           -0.4099729        -0.10677126
 12          2643         41681         0.6065        0                       reviewTextCLI        0.31215745        0.21371702
 13          2320         36058         0.6154        2                       summaryCLI           0.79694206        0.28470061
 14          2041         31385         0.6245        2                       reviewTextLength     0.30807426        0.3431656
 15          1822         27596         0.6355        2                       summaryLength        0                 0.03837902
 16          1616         24270         0.6415        2                       bias                 -4.26391009       -2.21159487
 17          1451         21519         0.6465        2                   Table 4: Coefficient of Linear Regression Model for
 18          1302         19131         0.6551        2                   minV otes = 12 and minV otes = 23
 19          1192         17274         0.657         2
 20          1079         15489         0.6625        2
 21          980          13899         0.6698        2                    Algorithm         Dataset Exec.   Correlation RMSE
 22          886          12451         0.6772        2
                                                                                             size    time    coefficient score
 23          793          11083         0.6804        2
 24          714          9940          0.684         2                    Linear Regression 918     0.01    0.6455      0.2005
 25          655          9069          0.6897        2                    Perceptron        918     0.12    0.5071      0.2635
  Table 3: Correlation Coefficient for various minVotes                    SVM               918     0.25    0.6352      0.218
                                                                           Linear Regression 3414    0.02    0.7226      0.1957
                                                                           Perceptron        3414    0.44    0.5135      0.2569
                                                                           SVM               3414    6.39    0.7199      0.1992
then safely decide to use the model learned on only the selected           Linear Regression 10971   0.02    0.6888      0.2023
features. In what follow, we report the results for both sets of           Perceptron        10971   1.39    0.5349      0.2658
features.                                                                  SVM               10971   101.87 0.6846       0.2062
   A second lesson learned with our feature selection step is              Linear Regression 29808   0.04    0.6303      0.2064
that, interestingly, for both datasets, the features selected include      Perceptron        29808   3.81    0.499       0.2401
features that were not present in [16], namely spelling Error Ratio,       SVM               29808   829.65 0.627        0.2119
polarity and deviation. With the notable exception of Summary              Linear Regression 522801 0.67     0.3352      0.3028
Spelling Error Ratio, these features’ weights remain steady, and          Table 5: Test of 3 linear model algorithms on various
in some cases relatively important, after feature selection. Quite        datasets
surprisingly, ReviewTextSER has no impact on helpfulness, while
as expected deviation highly contributes negatively to it.
  5.2.3 Comparison with the state-of-the-art. As to model ac-
curacy, Table 8 shows that the results we obtained are notably
comparable, and in some cases slightly better than those reported
          minV ot es = 12      Before        After                    0.354 and 0.451 while ours scores at 0.3697 for minV otes = 12 and
          rating               0.31117594    0.31312877               0.4651 for minV otes = 23 (the higher the better for the Efron’s
          polarityReviewText   0.36708846    0.3655654
                                                                      R 2 ). Importantly, their models incorporate the features number
          polaritySummary      0.05166703    0.05351795
          deviation            -0.20847153   -0.20951913
                                                                      of votes and number of helpful votes, which we have deliberately
          reviewTextSER        0             -0.03361242              not included in ours, since we aim at predicting helpfulness when
          summarySER           -0.28603436   -0.31027976              no such socres are available.
          reviewTextFOG        -1.10263702   N.A                          Finally, the two datasets minV otes = 12 and minV otes = 23
          summaryFOG           0             -0.04014441              achieve comparable MAE and RMSE, even though minV otes = 23
          reviewTextFK         4.37638627    0.4228708                shows a better correlation coefficient or Efron’s R 2 . This illus-
          summaryFK            0.12251469    N.A                      trates the robustness of our model construction approach to
          reviewTextARI        5.01873535    N.A                      larger but more skewed datasets.
          summaryARI           -0.4099729    N.A
          reviewTextCLI        0.31215745    0.04970302               5.3    Skill evaluation
          summaryCLI           0.79694206    0.40990694
          reviewTextLength     0.30807426    0.3077809                In this section, we show that the model obtained can be used
          summaryLength        0             0.03922442               to accurately predict the learning of the skill of writing helpful
          bias                 -4.26391009   -0.12802418              reviews.
Table 6: Models of helpfulness before and after feature se-              After training the Knowledge Tracing (KT) model as explained
lection for minV otes = 12                                            in Section 4.3 using a 10 fold cross validation, we acquire the
                                                                      average of the six parameters and the KT model RMSE scores.
                                                                      We also learn one KT per sub-skill and aggregate them to obtain
          minV ot es = 23      Before        After                    a single probability, as explained in Section 4.3. To be consistent
          rating               0.37121056    0.37369313               with the learning of the linear regression model, this aggregation
          polarityReviewText   0.27873667    0.28253483               is done with the weights learned for this model. The results are
          polaritySummary      0.08006764    0.0821465                reported in Table 9 and Table 10. Each table shows the average
          deviation            -0.1951008    -0.19656865              skill acquisition probability (mean(Ln )) for the actual helpfulness
          reviewTextSER        0             0                        skill, the helpfulness model and the aggregation of the sub-skills.
          summarySER           -0.25242002   -0.29930955              We also report the parameters learned for the KT of the model.
          reviewTextFOG        -0.40678142   N.A
          summaryFOG           0.02506183    -0.021072                          Scores            minVotes = 12 minVotes = 23
          reviewTextFK         2.3020136     0.13767929
                                                                        Actual  mean(Ln )         0.968337      0.960511
          summaryFK            0.13780373    N.A
                                                                        skill   variation(Ln )    0.025213      0.033238
          reviewTextARI        2.47411051    N.A
          summaryARI           -0.10677126   N.A                                P (L 0 )          0.007504      0.033457
          reviewTextCLI        0.21371702    0                                  P (T )            0.030262      0.077669
          summaryCLI           0.28470061    0.13525824                         mean(P (G))       0.349992      0.369982
          reviewTextLength     0.3431656     0.34368253                 Model variation(P (G)) 0.007067         0.0147
          summaryLength        0.03837902    0.07231388                         mean(P (S ))      0.412574      0.412882
          bias                 -2.21159487   0.13145667                         variation(P (S )) 0.016212      0.025905
Table 7: Models of helpfulness before and after feature se-                     mean(Ln )         0.783885      0.800687
lection for minV otes = 23                                                      variation(Ln )    0.090915      0.090820
                                                                        Aggre- mean(Ln )          0.999943      0.999991
                                                                        gated   variation(Ln )    0.000584      0.000122
 Evaluation Metrics           minV otes = 12 minV otes = 23                     a-mKRMSE          0.164619      0.156373
 Total Number of Reviews           41681         11083                          a-AggKRMSE        0.064818      0.081964
                   |Before feature selection|                         Table 9: KT parameters, prediction and predictive accu-
 Correlation Coefficient            0.608        0.682                racy before feature selection
 Efron’s R 2                       0.3697       0.4651
 Mean Absolute Error               0.1521       0.1494
 Root Mean Squared Error           0.2014        0.201                  For the sake of readability, we recall that RMSE scores are
                    |After feature selection|                         generated in three ways:
 Correlation Coefficient           0.6065       0.6804                    • RMSE as reported in table 8 represents the error between
 Efron’s R 2                       0.3678       0.4629                       the helpfulness model scores and the actual helpfulness
 Mean Absolute Error               0.1526         0.15                       scores, without KT involved at that point.
 Root Mean Squared Error           0.2017       0.2014                    • actual-model Knowledge RMSE (a-mKRMSE) represents
             Table 8: Evaluation of the models                               the error between the KT of the actual helpfulness scores
                                                                             and the KT of the helpfulness as computed with the model.
                                                                          • actual-Aggregated Knowledge RMSE (a-AggKRMSE) rep-
                                                                             resents the error between the KT of the actual helpfulness
in [16] on datasets of similar size (37,221 Amazon UK reviews                scores and the aggregation of the KT scores of each feature
were analyzed in that work). In that work, 3 models were con-                taken independently (i.e., each sub-skill).
structed, and their fitness to the dataset was reported in terms of     Before commenting the results of the tests, it is important to
Efron’s R 2 scores. Their three models obtained respectively 0.316,   note that the average value of the helpfulness skill acquisition
            Scores            minVotes = 12 minVotes = 23               and faked the helpfulness scores with generated random numbers
  Actual    mean(Ln )         0.968026      0.961189                    between 0 and 1. The result, reported in Table 11 confirms that for
  skill     variation(Ln )    0.02536       0.032609                    both datasets the RMSE values are bad. It infers that for random
            P (L 0 )          0.004666      0.030936                    sequence of numbers as the score of helpfulness, the model fails
            P (T )            0.026656      0.075609                    to predict the skill of the reviewers (that in this case is expectedly
            mean(P (G))       0.348688      0.369544                    close to 0.5).
  Model variation(P (G)) 0.007853           0.015195
            mean(P (S ))      0.414712      0.406663
                                                                        6    CONCLUSION
            variation(P (S )) 0.014588      0.027189
            mean(Ln )         0.774606      0.804482                    In this paper, we experimented with a large dataset of Amazon
            variation(Ln )    0.090247      0.093722                    book reviews to show that a model of review helpfulness can
  Aggre- mean(Ln )            0.96117       0.900541                    be used to assess the acquisition of the skill of writing help-
  gated     variation(Ln )    0.048756      0.055234                    ful reviews. Learning such an individual model of procedural
                                                                        knowledge acquisition has the advantages to be less prone to
            a-mKRMSE          0.170865      0.156404
                                                                        human variation and subjectivity (e.g., in judging the helpfulness
            a-AggKRMSE        0.062204      0.050704
                                                                        of a review) and to not have to define precisely a hard to define
Table 10: KT parameters, prediction and predictive accu-                skill, that is replaced by a model learned over the dataset. In our
racy after feature selection                                            experiments, we modeled the quality of a review by a linear com-
                                                                        bination of metrics stemming from text analysis (like readability,
                                                                        polarity, spelling errors or length) and we use customer declared
probability (i.e., the value to be predicted) is high. We conjecture    helpfulness as a ground truth for constructing the model. This
that this is due to the importance of the filtering, in terms of        model achieves comparable to slightly better accuracy results
number of reviews per reviewer and number of votes, applied             when compared to a state-of-the-art approach. We used Bayesian
over the dataset.                                                       Knowledge Tracing (KT), a popular model of skill acquisition,
                                                                        to measure the evolution of the ability to write reviews of good
   5.3.1 Accuracy of the two KT models. The key observation             quality over a period of time. Our tests validated our hypothesis,
is that switching to KT achieves very good to excellent RMSE            showing that the model of skill acquisition achieves a very good
scores, whatever the dataset considered. Notably, predicting the        to near perfect accuracy score.
skill of writing helpful reviews is done much more accurately              Our short term future works include the revision of both the
than predicting helpfulness. This allows to answer positively to        helpfulness model and the skill acquisition model. In particular,
the question expressed at the beginning of this paper: a model          the helpfulness model can be extended with advanced features
constructed on a large dataset can be used to assess procedural         like sentiment analysis or reviewer profiles features, while Deep
knowledge acquisition. Interestingly, predicting each sub-skill         Knowledge Tracing could be used instead of classical Knowledge
(corresponding to each feature) and combining these predictions         Tracing. We also want to better understand the relation between
to infer the global skill of writing helpful reviews is significantly   the linear coefficient learned for the helpfulness model and the
better than predicting the skill at the coarse level of the model.      KT parameters of the corresponding sub-skills. Long term goals
In our test, this combination was naively done with the weights         include the generalization of our approach to other datasets and
learned using the linear regression algorithm, normalized, bias         skills. We are particularly interested in better understanding
included, to build the model of helpfulness. It is left as future       in what contexts skill acquisition with model building is more
works to determine more sophisticated weight combination.               relevant than only building the model.

       Scores            minVotes = 12 minVotes = 23                    REFERENCES
      P (L 0 )           0             0                                 [1] Arpita Agnihotri and Saurabh Bhattacharya. 2016. Online Review Helpfulness:
      P (T )             0.014637      0.016329                              Role of Qualitative Factors. Psychology & Marketing 33, 11 (Dec 2016), 1006–
       mean(P (G))       0.346532      0.345426                              1017.
                                                                         [2] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWord-
       variation(P (G)) 0.007084       0.014245                              Net 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion
       mean(P ((S ))     0.409464      0.409721                              Mining. In LREC.
                                                                         [3] Kathleen M. Cauley. 1986. Studying Knowledge Acquisition: Distinctions
       variation(P (S )) 0.016246      0.029494                              among Procedural, Conceptual and Logical Knowledge. In 67th Annual Meeting
       mean(Ln )         0.518769      0.518702                              of the American Educational Research Association.
       variation(Ln )    0.110546      0.106100                          [4] Judith A Chevalier and Dina Mayzlin. 2006. The effect of word of mouth
                                                                             on sales: Online book reviews. Journal of marketing research 43, 3 (2006),
       a-mKRMSE          0.673532      0.669856                              345–354.
Table 11: KT parameters, prediction and predictive accu-                 [5] Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed
                                                                             for machine scoring. Journal of Applied Psychology 60, 2 (1975), 283.
racy for random sequences of helpfulness                                 [6] Albert T Corbett and John R Anderson. 1994. Knowledge tracing: Modeling
                                                                             the acquisition of procedural knowledge. User modeling and user-adapted
                                                                             interaction 4, 4 (1994), 253–278.
                                                                         [7] Yossi Ben David, Avi Segal, and Ya’akov (Kobi) Gal. 2016. Sequencing edu-
   5.3.2 Comparison with random sequences of helpfulness scores.             cational content in classrooms using Bayesian knowledge tracing. In LAK.
The small RMSE indicates that the KT model is good at predicting             354–363.
                                                                         [8] Anindya Ghose and Panagiotis Ipeirotis. 2009. The EconoMining project at
the learning of the writing skill of the reviewers. However, in              NYU: Studying the economic value of user-generated content on the internet.
order to validate the hypothesis that these good results do not              Journal of Revenue and Pricing Management 8, 2-3 (2009), 241–246.
                                                                         [9] Yue Gong, Joseph E. Beck, and Neil T. Heffernan. 2010. Comparing Knowledge
come from an intrinsic smoothing behavior of the KT model, we                Tracing and Performance Factor Analysis by Using Multiple Model Fitting
ran the model on random sequences of helpfulness score. To this              Procedures. In ITS. 35–44.
end, we generated as many sequences as the original data set has        [10] Robert Gunning. 1952. The technique of clear writing. (1952).
[11] William J. Hawkins, Neil T. Heffernan, and Ryan Shaun Joazeiro de Baker.
     2014. Learning Bayesian Knowledge Tracing Parameters with a Knowledge
     Heuristic and Empirical Probabilities. In ITS. 150–155.
[12] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual
     evolution of fashion trends with one-class collaborative filtering. In WWW.
     507–517.
[13] Hong Hong and Di Xu. 2015. Research of online review helpfulness based on
     negative binary regress model the mediator role of reader participation. In
     2015 12th International Conference on Service Systems and Service Management
     (ICSSSM). 1–5.
[14] Jingxian Jiang, Ulrike Gretzel, and Rob Law. 2010. Do Negative Experiences
     Always Lead to Dissatisfaction? - Testing Attribution Theory in the Context
     of Online Travel Reviews. In ENTER. 297–308.
[15] J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom.
     1975. Derivation of new readability formulas (automated readability index, fog
     count and flesch reading ease formula) for navy enlisted personnel. Technical
     Report. Naval Technical Training Command Millington TN Research Branch.
[16] Nikolaos Korfiatis, Elena García-Bariocanal, and Salvador Sánchez-Alonso.
     2012. Evaluating content quality and helpfulness of online product reviews:
     The interplay of review helpfulness vs. review content. Electronic Commerce
     Research and Applications 11, 3 (2012), 205–217.
[17] Yang Liu, Xiangji Huang, Aijun An, and Xiaohui Yu. 2008. Modeling and
     Predicting the Helpfulness of Online Reviews. In ICDM. 443–452.
[18] Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs:
     modeling the evolution of user expertise through online reviews. In WWW.
     897–908.
[19] Susan M. Mudambi and David Schuff. 2010. What Makes a Helpful Online
     Review? A Study of Customer Reviews on Amazon.com. MIS Quarterly 34, 1
     (2010), 185–200.
[20] Philip I. Pavlik, Hao Cen, and Kenneth R. Koedinger. 2009. Performance Factors
     Analysis - A New Alternative to Knowledge Tracing. In AIED. 531–538.
[21] Radek Pelánek. 2015. Metrics for Evaluation of Student Models. In EDM. 19.
[22] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sa-
     hami, Leonidas J. Guibas, and Jascha Sohl-Dickstein. 2015. Deep Knowledge
     Tracing. In NIPS. 505–513.
[23] RJ Senter and Edgar A Smith. 1967. Automated readability index. Technical
     Report. Univ. Cincinati.
[24] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer.
     2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Net-
     work. In HLT-NAACL.
[25] Yutao Wang and Neil T. Heffernan. 2013. Extending Knowledge Tracing
     to Allow Partial Credit: Using Continuous versus Binary Nodes. In AIED.
     181–188.
[26] Jianan Wu. 2017. Review popularity and review helpfulness: A model for user
     review effectiveness. Decision Support Systems 97 (2017), 92–103.
[27] Philip Fei Wu, Hans van der Heijden, and Nikolaos Korfiatis. 2011. The
     Influences of Negativity and Review Quality on the Helpfulness of Online
     Reviews. In ICIS.