<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can models learned from a dataset reflect acquisition of procedural knowledge? An experiment with automatic measurement of online review quality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martina Megasari</string-name>
          <email>ifrstname.lastname@etu.univ-tours.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Labroche</string-name>
          <email>ifrstname.lastname@univ-tours.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Pandu Wicaksono</institution>
          ,
          <addr-line>Chiao Yun Li, Clément Chaussade, Shibo Cheng</addr-line>
          ,
          <institution>University of Tours</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Patrick Marcel, Verónika Peralta, University of Tours</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Can models learned from a dataset reflect how good are humans at mastering a particular skill? This paper studies this question in the context of online reviews writing, where the skill corresponds to the procedural knowledge needed to write helpful reviews. To this end, we model the quality of a review by a combination of various metrics stemming from text analysis (like readability, polarity, spelling errors or length) and we use customer declared helpfulness as a ground truth for constructing the model. We use Knowledge Tracing, a popular model of skill acquisition, to measure the evolution of the ability to write reviews of good quality over a period of time. While recent studies have tried to measure the quality of a review and correlate it to helpfulness, to the best of our knowledge, our work is the first to address this question as the exercise of a reviewer's skill over a sequence of reviews. Our experiments on a set of 41,681 Amazon book reviews show that it is possible to accurately assess the individual skill acquisition of writing a helpful review, based on a statistical model of the procedural knowledge at hand rather than human evaluations prone to subjectivity and variations over time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>In today’s era of big and open data, plenty of datasets are
analyzed to derive models mimicking humans by using machine
learning techniques. The representation and assessment of user
knowledge opens new possibilities for big data analytics, as
differentiating among novice and expert users, taking advantage of
user experience for recommending (e.g. products or actions),
calculating advanced scores (e.g. credibility), assessing the quality
of users’ analysis, etc. In this paper we focus on the assessment
of procedural knowledge from large data collections.</p>
      <p>
        Procedural knowledge is the knowledge about how to do
something. Diferent from declarative knowledge, that is often
verbalized, application of procedural knowledge may not be easily
explained [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Models exists to evaluate procedural knowledge
acquisition, like for instance the popular Bayesian Knowledge
Tracing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Many open datasets illustrate the application of procedural
knowledge. For instance, Amazon review datasets like those
provided by He and McAuley [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] contain customer written
reviews, where the skill of writing helpful reviews is an example
of application of procedural knowledge. However, this skill is
dificult to define and assess. Reviews can be voted helpful or
not by customers, but this assessment is subjective and as such
subject to variations over time, and it is dificult to construct a
model that accurately predicts helpfulness of a review [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>In this paper, we show that it is possible to benefit from such
very large datasets to learn an individual model of procedural
knowledge acquisition. The resulting model of knowledge has
several nice properties: (1) it is not prone to the usual bias caused
by a single small set of evaluators that might be non
representative or produce a subjective evaluation, (2) it avoids defining
explicitly the procedural knowledge at hand that is replaced by a
statistical model learned over the large dataset. As a consequence,
the larger the dataset, the more accurate is the modeling of the
procedural knowledge, and the better the evaluation of the skill
for a user is.</p>
      <p>
        To illustrate this, we experiment a use case with a dataset of the
aforementioned Amazon on-line product reviews. We chose this
use case because it is prototypical of how procedural knowledge
influences decision making. For instance, Mayzlin and Chevalier
studied the efects of on-line book reviews of Amazon.com and
Barnesandnoble.com and found positive correlation between the
reviews and the transactions of the book [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This means that
the reviewers opinion play an important role in users’ decision
on the transaction. Automatic measurement of the reviewer skill
may be beneficial to predict how helpful the review is. A skillful
writer is assumed to be able to write a good review, which can
help the customer to make a better decision on the transaction.
      </p>
      <p>To motivate our approach, suppose that we want to determine
whether a reviewer is assumed to master the skill of writing
helpful reviews. This is preferable to trying to predict helpfulness
of the reviews, because of the high variability of the reviewer
profiles, reviews and votes received by reviews. However this
skill corresponds to procedural knowledge and it is dificult to
define. Therefore to evaluate the skill of each reviewer, we use
the classical Knowledge Tracing model. But instead of using the
Knowledge Tracing directly over the votes received by reviews,
we apply it over a model of helpfulness learned from each review.
Our research question is: can this model of helpfulness be used
to assess the skill accurately? Consider the 4 curves displayed
in Figure 1. These curves are related to the evolution over time
of the skill of writing helpful reviews of a particular reviewer
(randomly extracted from the Amazon book review dataset). The
helpfulness curve is the normalized score of helpfulness received
by the 20 reviews written by this reviewer. The model curve is
the helpfulness score as predicted for this reviewer by a model
learned over the entire dataset. The KT helpfulness curve predicts
the probability that this reviewer has acquired the skill of writing
helpful reviews, computed with the helpfulness score. The KT
model curve is the same probability computed with the model.
On this example, it is obvious that even though the skill can
be considered acquired, helpfulness score is dificult to predict
due to subjectivity of the voters. On the other hand, a model of
helpfulness can be learned to predict if the skill has been acquired.</p>
      <p>The contributions of this paper are the following: (1) assuming
that writing helpful reviews is a hard to define skill, we propose a
model for it. We use low level features of the on-line review such
as rating, spelling error ratio or readability score to build the
model that infers a high level and human-related feature which
is helpfulness. This model is learned over the entire dataset and
can be used to predict the helpfulness of future reviews for one
particular reviewer. (2) Using Knowledge Tracing, we show that
this model can be used to assess skill acquisition without relying
on human entered votes. In particular, we show that this model,
although learned over the entire dataset, is accurate enough to
predict if the skill is acquired by each individual reviewer. To
the best of our knowledge, this work is the first to evaluate a
reviewer’s skill over a sequences of reviews with Knowledge
Tracing.</p>
      <p>The remainder of the paper is organized as follows. Section
2 discusses related works. Section 3 defines the features used to
build the model of helpfulness. Section 4 details our approach.
Section 5 explains how the experiment is performed to build the
model and exposes the results. Finally, Section 6 concludes the
paper and discusses some possible future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORKS AND BACKGROUND</title>
      <p>We first review recent works on online review evaluation and
then describe the Bayesian Knowledge Tracing model and some
of its extensions.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Online review evaluation</title>
      <p>
        Readability tests play an important role in online review
evaluation. Various indexes have been proposed to quantify readability
of an English text. Most of these indexes are related to the level of
studies a person needs to understand the text at the first reading,
according to American standard. They are computed considering
the number of words, number of sentences, number of syllables
or number of characters as components. The Gunning-Fog Index
(FOG) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] aims to estimate the years of formal education a
person needs to understand the text during the first reading. The
Flesch Reading Ease (FK) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] indicates the dificulties of a text
using the number of words, number of sentences and number
of syllables. Higher values indicate better readability. The
Automated Readability Index (ARI) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] measures the approximate
representation of the US grade level needed to understand the
text. The Coleman-Liau Index (CLI) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is the approximation of
US grade level needed to understand the text. More background
on readability tests can be found in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        Previous works have studied the evaluation of online reviews
due to the popularity of online marketing nowadays. Authors
often pay attention to the influence of online reviews on helpfulness.
Korfiatis et al. investigated the interplay between helpfulness,
rating score and qualitative characteristics of the review text of
37,221 online reviews collected from Amazon UK during March
to April in 2008 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The authors theorize that helpfulness
relates to a model with three aspects: conformity (relation between
the review text and the rating), understandability (readability of
the review text) and expressiveness (length of the review text).
The authors formulate several hypotheses and perform linear
regression to validate the relationship between the metrics derived
from reviews and the helpfulness of the reviews. Regarding
understandability, four common readability scores - indicating the
education level the readers need to have in order to understand
the content - are computed: FOG, FK, ARI and CLI. Their results
indicate that helpfulness of a review is directionally afected by
its qualitative characteristics and in particular by review text
readability. Precisely, the relationship between reviews with
average length and their readability scores holds for both moderate
and extreme reviews. In addition, readability has more impact
on the length of the reviews. In their work, metrics related to
polarity, summary text of reviews and rating deviation (between
the average rating and the reviewer’s one) are not considered.
Moreover, due to the purpose of the work, the books having
special ofers are not considered to avoid the price efect. In our work,
such books are chosen due to the amount of reviews resulting
from this price efect.
      </p>
      <p>
        Based on the 7,659 book reviews on Amazon UK, Wu et al.
explored whether a negative bias exists in terms of evaluating the
helpfulness [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. The assumption was that negative reviews may
be more helpful than positive ones. After applying a regression
model controlling factors such as readability and length of the
reviews, the result shows that the assumption is not yet readily
applicable to online reviews.
      </p>
      <p>
        Mudambi and Schuf analyzed 1,587 reviews from Amazon.com
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] to understand how review extremity, review depth and
product type afect the perceived helpfulness of the review. Their
helpfulness model is based on features rating, review text word
count, total votes and product type. Product type is either
Experience goods or Search goods, where Experience goods are
products that require sampling or purchase in order to evaluate
product quality. Books are examples of experience goods. They
found that for experience goods, moderate reviews are more
helpful than extreme reviews (whether they are strongly positive
or negative). In contrast, it has been observed that reviews closer
to the general opinion of people (average rating score) may be
considered more helpful by the potential buyers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Mc Auley and Leskovec [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] propose a latent-factor model
for recommending products that may be preferred by the users
according to their experience level at the moment. The model
evaluates the evolution of users’ experiences and is based on
the rating that users give to products. Unlike other works on
temporal dynamics, which rely on the hypothesis that two users
rating a product at the same time will provide the same rating, Mc
Auley and Leskovec’s model takes users’ personal development
into consideration in order to evaluate the expertise degree of the
reviewers. Experiments showed for example that experts’ ratings
are easier to predict and are more similar to each other. While
close to our work in the idea of taking the evolution of the user
into account, this work focuses on ratings and not helpfulness,
and therefore does not consider the linguistic aspect of review
text.
      </p>
      <p>
        Liu et al. considered a complex model learned using non-linear
regression, that combines the reviewer’s expertise (based on the
number of similar reviews written in the past), the writing style
of the review (characterized with part of speech tagging and
counting the number of words in each tag), and the timeliness
of the review [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. They showed that the three factors predict
accurately helpfulness, over a dataset of 22,819 reviews collected
from IMDB.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], review helpfulness is considered through five features
including user profile aspects (age, verified purchase) together
with rating, text length and the rank of the review in the webpage.
A model learned on 12,756 reviews was shown to be reasonably
robust.
      </p>
      <p>
        Agnihotri and Bhattacharya explored how the helpfulness of
online reviews is afected by content readability (FK Index),
sentiment analysis and the number of reviews written by a reviewer
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It was observed on 1608 Amazon reviews that the content
readability and text sentiment of the reviews follow curvilinear
relationship with review helpfulness. Reviews whose
readability score are very high or sentiment are very good would be
perceived less helpful.
      </p>
      <p>
        Hong and Xu analyze the impact of review message and
reviewer profile on the helpfulness of 2997 online reviews collected
from Douban.com [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Using negative binomial regression, the
authors show that reader participation is positively related to
online review helpfulness; Reader participation fully mediates the
efect of reviewer expertise history on online review helpfulness
and partially mediates the efects of other three metrics: average
rating, title depth and reviewer network centrality.
      </p>
      <p>To the best of our knowledge, no work ever focused on the
evolution of the quality of review text under the angle of skill
acquisition, with a model learned only on the review content.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Knowledge Tracing Models</title>
      <p>
        The Bayesian Knowledge Tracing model was proposed by
Corbett and Anderson, using Bayesian network to assess people’s
procedural knowledge acquisition or simply put “skill level” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
An individual’s grasp of the procedural knowledge is expressed
as a binary variable expressing whether the corresponding skill
has been mastered or not. The knowledge of an individual
cannot be directly observed, but it can be induced by observing the
individuals’ answers to a series of questions (or opportunities
to exercise the skill) in order to guess probability distribution of
knowledge mastering. Observation variables are also binary: the
answer to the question is either correct or wrong.
      </p>
      <p>Specifically, the Knowledge Tracing model has four
parameters, namely, two learning parameters, P (L0) and P (T ), and two
performance parameters, P (G ) and P (S ). P (L0) is the probability
that the skill has been mastered before answering the questions.
P (T ) is the knowledge transformation probability: the probability
that the skill will be learned at each opportunity to use the skill
(i.e., the transition from not mastered to mastered). P (G ) is the
probability of guess: in the case of knowledge not mastered, the
probability that the individual can still answer correctly. P (S ) is
the probability of slip, i.e. to fail while the skill is already
mastered. The model uses these parameters to calculate the learning
probability after each question to monitor individual’s knowledge
status and predict their future learning probability of knowledge
acquisition using a Bayesian Network.</p>
      <p>The probability that a skill L at question i + 1 is mastered,
denoted P (Li+1) is the sum of two probabilities: (1) the posterior
probability that the skill was already learned, contingent on the
evidence at time i, i.e. the ith opportunity to evaluate the skill,
that can either be Correct or Incorrect, and (2) the probability that
the knowledge changes from not mastered to mastered at the ith
opportunity. It can be shown in the following formula:
P (Li+1) = P (Li |Evidencei ) + (1 − P (Li |Evidencei )) ∗ P (T ) (1)
where:
P (Li |Evidencei = Correct ) =
P (Li |Evidencei = Incorrect ) =</p>
      <p>P (Li ) ∗ P (¬S )
P (Li ) ∗ P (¬S ) + P (¬Li ) ∗ P (G )</p>
      <p>P (Li ) ∗ P (S )</p>
      <p>P (Li ) ∗ P (S ) + P (¬Li ) ∗ P (¬G )</p>
      <p>
        Due to its predictive accuracy, Corbett and Anderson’s Bayesian
Knowledge Tracing is one of the most popular models.
However, several challenges, including local minimum, degenerate
parameters and computational costs during fitting, still exist.
Hawkins et al. proposed a fitting method avoiding these
problems while achieving a similar predictive accuracy, and evaluated
it against one of the most popular fitting methods:
ExpectationMaximization [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In this extension, the parameters are fitted
by estimating the most likely opportunity at which each
individual learned the skill. Learner’s performance is thus annotated
with an estimate of when the skill is learned, assuming that a
known state can never be followed by an unknown state. This
annotation is used to construct knowledge sequences, that when
Feature name
rating
polarityReviewText
polaritySummary
deviation
reviewTextSER
summarySER
reviewTextFOG
summaryFOG
reviewTextFK
summaryFK
reviewTextARI
summaryARI
reviewTextCLI
summaryCLI
reviewTextLength
summaryLength
Feature name
rating
polarityReviewText
polaritySummary
deviation
reviewTextSER
summarySER
reviewTextFOG
summaryFOG
reviewTextFK
summaryFK
reviewTextARI
summaryARI
reviewTextCLI
summaryCLI
reviewTextLength
summaryLength
      </p>
      <p>Min</p>
      <p>1
-0.875
-0.875
0
0
0
0
0
-1788.235
-1824.58
-6.837
-16.22
-22.24
-58.13
0
1
compared with the actual performance sequence allows to
empirically derivate the model’s four parameters.</p>
      <p>
        As aforementioned, traditionally, the performance of an
individual is presented in binary value, correct or wrong, which does
not account for all the cases of skill learning situation. Wang
et al. proposed to extend the Knowledge Tracing model by
replacing the discrete binary performance node with continuous
partial credit node [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. In this extension, it is assumed that P (G )
and P (S ) follow two Gaussian distributions, that are described
respectively by their means and standard deviations. Prediction
of the performance node also follows a Gaussian distribution,
in which the mean value is used for the prediction. Noticeably,
the standard deviation contains the information of how good
the prediction is. Experiments with this extension show that by
relaxing the assumption of binary correctness, the predictions of
an individual’s performance can be improved.
      </p>
      <p>
        These two improvements of the Knowledge Tracing model (in
the fitting method and the use of partial credits) were used
successfully in sequencing educational content to students [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We
conclude this section by noting that other models exist for
predicting a learner’s skill. Specifically, Performance Factor Analysis
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] uses standard logistic regression with the student
performance as dependent variable. Interestingly, it is shown in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] that
Knowledge Tracing can achieve comparable predictive accuracy
as Performance Factor analysis. Finally, Deep Knowledge Tracing
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] uses Recurrent Neural Networks to model student learning,
with the advantage of not having to set explicit probabilities for
slip and guess. However these models need very large datasets
to learn the latent state from sequences, and most importantly,
the encoding of the input vectors depends on an upper bound on
the number of exercises which does not directly fit our context.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>FEATURES AND METRICS</title>
      <p>
        Consistently with the previous work of Korfiatis et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], our
model of helpfulness is based on features that are grouped in
three categories: Conformity, Understandability and
Extensiveness, with additional features compared to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. We derive
metrics, i.e., numerical attributes to be used in the definition of our
model, from these features. Conformity expresses the consistency
of a review being written. In addition to the classical rating, we
add two metrics in this category: Polarity and Deviation.
Understandability measures how good is the quality of the written text
in terms of readability. We derived five metrics to measure the
score: Spelling Error Ratio and 4 readability metrics (FOG, FK,
ARI, and CLI). Finally extensiveness refers to the length of the
review. In total, 16 metrics are defined, since length and
readability metrics apply both to the review text and summary. We
detail them below, a summary of the features used in the
experiments with their name, category, theoretical and empirical range
is provided in Tables 1 and 2.
3.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conformity</title>
      <p>Metrics in this category relate to the consistency of the review. As
the content of a review consists in a rating and a written text, we
can derive a relation between them. A rating should correspond
to the written review and vice versa, hence diference between
these two contents might indicate that the review is inconsistent.
For example, a review having 5 stars rating and very negatively
written is inconsistent. Needless to say, inconsistent reviews
may lead to lower score of helpfulness due to the confusion it
brings. From this perspective, we consider Polarity of the text,
Applies to
all
text
summary
all
text
summary</p>
      <p>text
summary</p>
      <p>text
summary</p>
      <p>text
summary</p>
      <p>text
summary</p>
      <p>text
summary</p>
      <p>
        Mean
4.112
0.027
0.029
0.452
0.009
0.014
13.983
9.524
58.96
59.537
11.41
5.162
8.64
5.387
1152.094
28.875
which indicates the positiveness or negativeness of a review as a
metric. Besides, the extremity of the rating given by the reviewer
may indicate that the reviewer is biased and has a subjective
point of view on the product being reviewed. Extremely high
and low rating is associated with lower levels of helpfulness than
reviews with moderate rating [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In contrast, reviews closer
to the general opinion of people (average rating score) may be
considered more helpful by the potential buyers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. From this
perspective, we derived the Deviation score, quantifying how
much diferent the rating given by the reviewer is to the average
rating.
      </p>
      <p>Rating. The Rating of a review is the user input quantitative
indicator of the quality of the item reviewed (e.g., rating is from
1 to 5 for Amazon Book Reviews).</p>
      <p>Polarity. Polarity of a text is measured by using a word list that
indicates the positivity, negativity and objectivity of each synset.
Polarity score of a word with the part of speech is calculated as
the score of the positivity subtracted by the score of negativity.
The range of the value of polarity is between -1 and 1, -1 indicates
that the written text is very negative and 1 indicates that the
written text is very positive.</p>
      <p>
        Deviation. Deviation is calculated as the absolute diference
between the rating of a review and the average rating of the item
reviewed.
Metrics in this category relate to the efort needed to understand
the text of the review. This is measured based on the number of
spelling errors in the written text, which is expected to be
negatively correlated to helpfulness [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and with various readability
measures.
      </p>
      <p>Spelling Error Ratio (SER). Spelling Error Ratio is the number
of spelling errors divided by the text length.</p>
      <p>
        Gunning-Fog Index (FOG). The FOG [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] aims to estimate the
years of formal education (according to the American system)
a person needs to understand the text during the first reading.
This index uses the number of words, the number of sentences
and the number of complex words to measure the years. A word
is considered as a complex word if the word is using more than
two syllables.
      </p>
      <p>nbW ords nbComplexW ords
FOG = 0.4[( ) + 100( )] (2)
nbSentences nbW ords</p>
      <p>
        Flesch Reading Ease (FK). The FK index [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] indicates the
dificulties of a text using the number of words, number of sentences
and number of syllables.
      </p>
      <p>nbW ords nbSyllables
F K = 206.835 − 1.015( ) − 84.6( ) (3)
nbSentences nbW ords</p>
      <p>
        Automated Readability Index (ARI). The ARI [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] approximates
the US grade level needed to understand the text. This index uses
number of characters, number of words and number of sentences.
      </p>
      <p>nbCharacters nbW ords
ARI = 4.71( ) + 0.5( ) − 21.43 (4)
nbW ords nbSentences</p>
      <p>
        Coleman-Liau Index (CLI). The CLI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Like ARI, is the
approximation of US grade level needed to understand the text.
This index also uses number of characters, number of words and
number of sentences as components.
      </p>
      <p>CLI = 5.89(
nbCharacters
nbW ords
) − 0.3(
nbSentences
nbW ords
) − 15.8
(5)
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>Extensiveness</title>
      <p>The textual part of the review consists of a text and a summary
of this text. For both we measure the length in characters,
respectively called Review Text Length and Summary Length.
4</p>
    </sec>
    <sec id="sec-8">
      <title>METHODOLOGY</title>
      <p>Our approach is divided into three phases: metric extraction,
model construction and skill evaluation. These phases are detailed
below.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Metric extraction and feature selection</title>
      <p>In the first phase, we calculate for each review the scores for the
metrics presented in Section 3, that we use to build the model of
helpfulness. Then we apply feature selection to reduce the set
of metrics by removing redundant ones, while avoiding losing
too much information on the data set. We use a heuristic greedy
method by calculating all the pairwise correlations between
metrics. For those metrics that are highly correlated, only the ones
highly correlated with the helpfulness score will be kept, the
others being discarded. Finally, we normalize the scores in order
to be independent of attribute ranges and units and highlight
the actual importance of each attribute. We use Min-Max Scaling
normalization strategy.</p>
      <p>m
y = X ωi × xi + b (7)</p>
      <p>i=1
where ωi ∈ R is the weight reflecting the contribution of feature
i to the overall decision and b ∈ R stands for the bias.</p>
      <p>The intuition behind restricting our study to linear models is
mainly for two reasons. First, these models are more simple and
can be calculated more eficiently. Second, they allow for a direct
interpretation of the contribution of each feature to the final
helpfulness decision. To this end, we try a variety of methods
and keep the one best fitting the dataset.</p>
      <p>In our tests, error measurement is done using classical
correlation coeficient, Efron’s R2, MAE and RMSE scores.
In this last phase, we apply Knowledge Tracing (KT) to sequences
of reviews in order to estimate reviewers’ skills. We proceed as
follows: We group the reviews by reviewers, obtaining one
sequence of reviews per reviewer. Each review is considered as
an opportunity to learn the skill (i.e. being able to write useful
reviews) and is graded with a score, representing the reviewer’s
performance (i.e. how useful is the review). We compute two
KT scores: (i) directly from helpfulness ratings, and (ii) from the
learned helpfulness model. In the former, the reviewer’s
performance is calculated as the helpfulness score of the review. In the
latter, it is predicted by the helpfulness model. In both cases, the
ifnal score, output by KT model, expresses the probability that
the skill is mastered by the reviewer.</p>
      <p>
        We use the continuous version of KT described in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] since
the scores we will consider are continuous. In this extension of
KT, P (G ) and P (S ) are assumed to follow a Gaussian
distribution, and as such, they are represented by a mean value and a
standard deviation. As a consequence, and opposed to binary
KT, the prediction P (Ln ) also follows a Gaussian distribution,
whose mean expresses the value of the prediction and whose
standard deviation expresses the confidence attached to this
prediction. To learn the 6 parameters of continuous KT, we extend
the approach proposed by Hawkins et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] so that it outputs
estimates of P (G ) and P (S ) described by a mean and a standard
deviation. Then, based on these 6 parameters, the estimation of
each skill acquisition P (Ln ) is performed by running 100 tests,
with randomly generated values for P (G ) and P (S ) following
their respective distribution. From these 100 P (Ln ) estimates, we
compute a mean and a standard deviation following the normal
hypothesis.
      </p>
      <p>However, the KT eficiency is known to be dependent on the
granularity of skills that are fed to the model: generally, the more
focused the skills, the better the prediction of skill acquisition. In
this respect, it is possible to consider that each of the features that
fed our linear predictive model of helpfulness can be considered
as a sub-skill related to helpfulness. For this reason, we define
two distinct tests to evaluate the learned model of helpfulness:
In the first we simply use the output of the linear regression
model as the predicted helpfulness for a review. In the second,
we consider each feature metric as a possible sub-skill evaluation
of the reviewer. We then learn as many KT models as there are
features. In the end, we have the probabilities that sub-skills
corresponding to each feature are acquired. These sub-skills scores
are then aggregated into one single skill acquisition probability.</p>
      <p>
        The global validation of our proposal is given by measuring
the error between the KT based on real ratings, the KT based on
the general linear model and the KT based on aggregated
featurebased models. This error is evaluated by RMSE, which has been
shown to be the strongest performance indicator for binary KT
with significantly higher correlation than Log Likelihood and
Area Under Curve [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTS</title>
      <p>
        Our implementation is done in Java 8, with Weka 3.8 for model
learning. We used our own implementation of the knowledge
tracing, whose code has been made available through Github1
as one contribution of this paper. For polarity extraction, we use
SentiWordNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], that lists the positivity, negativity and
objectivity of each synset (set of synonyms). SentiWordNet provides
the score of each word with the part-of-speech, hence we do POS
tagging for each word using Stanford POS tagging library [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
5.1
      </p>
    </sec>
    <sec id="sec-11">
      <title>Dataset description</title>
      <p>
        The dataset we use for experiments is Amazon Book Review Data
provided by Julian McAuley from UCSD [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We select the book
category in this dataset resulting in 22,507,155 total reviews.
      </p>
      <p>As one of our goals is to measure the evolution of the ability to
write reviews of good quality, we need to obtain for each reviewer
a sequence of reviews long enough to observe that evolution.
Therefore, we define reviewers with less than 30 reviews as not
so active reviewers and filter them out. In addition, we only
consider the reviews that have been scored by customers by
means of votes (helpful review or not).</p>
      <p>To confirm the hypothesis that few reviewers have written
many reviews and that many reviewers have written few reviews,
we plotted on Figure 2 the number of reviewers (on a
logarithmic scale) by number of reviews, for reviewers with more than
30 reviews. Each points (x, y) in this figure represents that x
reviewers have written y reviews. Furthermore, we found
reviewers writing so much reviews that are dubious and possibly bias
their reviews. For instance, reviewer of ID A14OJS0VWMOSWO
wrote 43,201 reviews with an average score of 4.9991 out of 5.
The reviewer received 240,262 votes, of which 199,573 are helpful.
In our opinion, such reviewers introduce a bias in the dataset.
Hence we limited our experiment and selected reviewers that
have 30 to 50 reviews.</p>
      <p>
        We calculate the score of each feature from the dataset and
calculate their standard deviations, reported in the last column
of Table 2. The standard deviation of the helpfulness, that varies
in [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], is 0.32, which indicates that the score is quite spread out
1https://github.com/Cubiccl/Continuous-Knowledge-Tracing/releases/tag/1.0
and the dataset has a wide enough variety, from helpful reviews
and not helpful reviews. Moreover, the standard deviations of
the features indicate that creating a model from this dataset is
dificult.
5.2
We now describe how the model of helpfulness is learned from
the dataset. Consistently with [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] our model of helpfulness is
constructed as a linear combination of the metrics extracted from
the review text and summary. More precisely, as explained in
Section 4, we use a linear classifier to learn a weight for each
of the features introduced in the previous section, in order to
understand its contribution to the helpfulness score. We tested
three diferent approaches to learn the feature’s weights: Linear
Regression, Perceptron and Support Vector Machine with linear
kernel. We used out-of-the-shelf Weka algorithms with 10-fold
cross validation. Table 5 summarizes the results of those tests, for
various size of dataset selected according to minimum number
of votes for the reviews (918 reviews with number of votes being
at least 200, up to 522804 reviews with at least 1 vote). Results
for Perceptron and SVM are not reported for the largest dataset
due to too much time consumption. The results show that linear
regression achieves a good compromise of accuracy and
computation time, with better accuracy on smaller datasets and better at
handling larger datasets with no significant drop in accuracy. We
therefore chose to work with linear regression in what follows.
      </p>
      <p>5.2.1 Preprocessing. We recall that our definition of
helpfulness is the number of helpful votes divided by the total number
of votes, hence, a review with large number of votes is a
genuine representation of helpfulness from a customer’s point of
view. But a review with only one vote, being a helpful one, can
still obtain a maximum helpfulness score, which is not desirable.
Filtering the dataset by number of votes becomes necessary. In
order to find the appropriate minimum number of votes for each
review, we iterated this parameter from 1 to 25 for the most
important features of our model (i.e., after feature selection), and
checked the results in terms of correlation and expressiveness
(contribution of each metric), reported in Table 3. We decided
to choose 2 datasets among those tested, based on, first,
expressiveness (determined by the non zero value of coeficient in the
linear model), and second, correlation coeficient (that indicates
to which extent the model matches the dataset), for more than
10,000 reviews. The best interestingness and correlation
coefifcients were obtained for, respectively at least 12 votes and at
least 23 votes. In this phase, we are not sure about the efect of
these parameters on knowledge tracing model. Therefore, we
keep two data sets, to see which can give a better result in
knowledge tracing model. In what follows, the first dataset is called
minV otes = 12 and consists of 41,681 reviews while the second
dataset is called minV otes = 23 and consists of 11,083 reviews.</p>
      <p>Using linear regression on the two datasets minV otes = 12 and
minV otes = 23 results in the models described in Tables 6 and 7
respectively. The models constructed are evaluated with
correlation coeficient, Efron’s R2, MAE and RMSE scores, reported in
Table 8.</p>
      <p>5.2.2 Feature selection impact. We then proceed to feature
selection, as described in Section 4.1. As shown in table 8, our
models before and after feature selection achieve very similar
accuracy results. If eficiency in learning the model is an issue,
or if the model should remain as simple as possible, one can
then safely decide to use the model learned on only the selected
features. In what follow, we report the results for both sets of
features.</p>
      <p>
        A second lesson learned with our feature selection step is
that, interestingly, for both datasets, the features selected include
features that were not present in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], namely spelling Error Ratio,
polarity and deviation. With the notable exception of Summary
Spelling Error Ratio, these features’ weights remain steady, and
in some cases relatively important, after feature selection. Quite
surprisingly, ReviewTextSER has no impact on helpfulness, while
as expected deviation highly contributes negatively to it.
      </p>
      <p>
        5.2.3 Comparison with the state-of-the-art. As to model
accuracy, Table 8 shows that the results we obtained are notably
comparable, and in some cases slightly better than those reported
in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] on datasets of similar size (37,221 Amazon UK reviews
were analyzed in that work). In that work, 3 models were
constructed, and their fitness to the dataset was reported in terms of
Efron’s R2 scores. Their three models obtained respectively 0.316,
0.354 and 0.451 while ours scores at 0.3697 for minV otes = 12 and
0.4651 for minV otes = 23 (the higher the better for the Efron’s
R2). Importantly, their models incorporate the features number
of votes and number of helpful votes, which we have deliberately
not included in ours, since we aim at predicting helpfulness when
no such socres are available.
      </p>
      <p>Finally, the two datasets minV otes = 12 and minV otes = 23
achieve comparable MAE and RMSE, even though minV otes = 23
shows a better correlation coeficient or Efron’s R2. This
illustrates the robustness of our model construction approach to
larger but more skewed datasets.</p>
    </sec>
    <sec id="sec-12">
      <title>5.3 Skill evaluation</title>
      <p>In this section, we show that the model obtained can be used
to accurately predict the learning of the skill of writing helpful
reviews.</p>
      <p>After training the Knowledge Tracing (KT) model as explained
in Section 4.3 using a 10 fold cross validation, we acquire the
average of the six parameters and the KT model RMSE scores.
We also learn one KT per sub-skill and aggregate them to obtain
a single probability, as explained in Section 4.3. To be consistent
with the learning of the linear regression model, this aggregation
is done with the weights learned for this model. The results are
reported in Table 9 and Table 10. Each table shows the average
skill acquisition probability (mean(Ln )) for the actual helpfulness
skill, the helpfulness model and the aggregation of the sub-skills.
We also report the parameters learned for the KT of the model.</p>
      <p>For the sake of readability, we recall that RMSE scores are
generated in three ways:
• RMSE as reported in table 8 represents the error between
the helpfulness model scores and the actual helpfulness
scores, without KT involved at that point.
• actual-model Knowledge RMSE (a-mKRMSE) represents
the error between the KT of the actual helpfulness scores
and the KT of the helpfulness as computed with the model.
• actual-Aggregated Knowledge RMSE (a-AggKRMSE)
represents the error between the KT of the actual helpfulness
scores and the aggregation of the KT scores of each feature
taken independently (i.e., each sub-skill).</p>
      <p>Before commenting the results of the tests, it is important to
note that the average value of the helpfulness skill acquisition
probability (i.e., the value to be predicted) is high. We conjecture
that this is due to the importance of the filtering, in terms of
number of reviews per reviewer and number of votes, applied
over the dataset.</p>
      <p>5.3.1 Accuracy of the two KT models. The key observation
is that switching to KT achieves very good to excellent RMSE
scores, whatever the dataset considered. Notably, predicting the
skill of writing helpful reviews is done much more accurately
than predicting helpfulness. This allows to answer positively to
the question expressed at the beginning of this paper: a model
constructed on a large dataset can be used to assess procedural
knowledge acquisition. Interestingly, predicting each sub-skill
(corresponding to each feature) and combining these predictions
to infer the global skill of writing helpful reviews is significantly
better than predicting the skill at the coarse level of the model.
In our test, this combination was naively done with the weights
learned using the linear regression algorithm, normalized, bias
included, to build the model of helpfulness. It is left as future
works to determine more sophisticated weight combination.</p>
      <p>5.3.2 Comparison with random sequences of helpfulness scores.
The small RMSE indicates that the KT model is good at predicting
the learning of the writing skill of the reviewers. However, in
order to validate the hypothesis that these good results do not
come from an intrinsic smoothing behavior of the KT model, we
ran the model on random sequences of helpfulness score. To this
end, we generated as many sequences as the original data set has
and faked the helpfulness scores with generated random numbers
between 0 and 1. The result, reported in Table 11 confirms that for
both datasets the RMSE values are bad. It infers that for random
sequence of numbers as the score of helpfulness, the model fails
to predict the skill of the reviewers (that in this case is expectedly
close to 0.5).</p>
    </sec>
    <sec id="sec-13">
      <title>6 CONCLUSION</title>
      <p>In this paper, we experimented with a large dataset of Amazon
book reviews to show that a model of review helpfulness can
be used to assess the acquisition of the skill of writing
helpful reviews. Learning such an individual model of procedural
knowledge acquisition has the advantages to be less prone to
human variation and subjectivity (e.g., in judging the helpfulness
of a review) and to not have to define precisely a hard to define
skill, that is replaced by a model learned over the dataset. In our
experiments, we modeled the quality of a review by a linear
combination of metrics stemming from text analysis (like readability,
polarity, spelling errors or length) and we use customer declared
helpfulness as a ground truth for constructing the model. This
model achieves comparable to slightly better accuracy results
when compared to a state-of-the-art approach. We used Bayesian
Knowledge Tracing (KT), a popular model of skill acquisition,
to measure the evolution of the ability to write reviews of good
quality over a period of time. Our tests validated our hypothesis,
showing that the model of skill acquisition achieves a very good
to near perfect accuracy score.</p>
      <p>Our short term future works include the revision of both the
helpfulness model and the skill acquisition model. In particular,
the helpfulness model can be extended with advanced features
like sentiment analysis or reviewer profiles features, while Deep
Knowledge Tracing could be used instead of classical Knowledge
Tracing. We also want to better understand the relation between
the linear coeficient learned for the helpfulness model and the
KT parameters of the corresponding sub-skills. Long term goals
include the generalization of our approach to other datasets and
skills. We are particularly interested in better understanding
in what contexts skill acquisition with model building is more
relevant than only building the model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Arpita</given-names>
            <surname>Agnihotri</surname>
          </string-name>
          and
          <string-name>
            <given-names>Saurabh</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Online Review Helpfulness: Role of Qualitative Factors</article-title>
          .
          <source>Psychology &amp; Marketing</source>
          <volume>33</volume>
          ,
          <issue>11</issue>
          (Dec
          <year>2016</year>
          ),
          <fpage>1006</fpage>
          -
          <lpage>1017</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Baccianella</surname>
          </string-name>
          , Andrea Esuli, and
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining</article-title>
          .
          <source>In LREC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kathleen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cauley</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>Studying Knowledge Acquisition: Distinctions among Procedural, Conceptual and Logical Knowledge</article-title>
          .
          <source>In 67th Annual Meeting of the American</source>
          Educational Research Association.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Judith</surname>
            <given-names>A Chevalier</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Dina</given-names>
            <surname>Mayzlin</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The efect of word of mouth on sales: Online book reviews</article-title>
          .
          <source>Journal of marketing research 43</source>
          ,
          <issue>3</issue>
          (
          <year>2006</year>
          ),
          <fpage>345</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Meri</given-names>
            <surname>Coleman</surname>
          </string-name>
          and Ta Lin Liau.
          <year>1975</year>
          .
          <article-title>A computer readability formula designed for machine scoring</article-title>
          .
          <source>Journal of Applied Psychology</source>
          <volume>60</volume>
          ,
          <issue>2</issue>
          (
          <year>1975</year>
          ),
          <fpage>283</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Albert</surname>
            <given-names>T Corbett</given-names>
          </string-name>
          and
          <string-name>
            <surname>John R Anderson</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction 4, 4 (</article-title>
          <year>1994</year>
          ),
          <fpage>253</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Yossi</given-names>
            <surname>Ben</surname>
          </string-name>
          <string-name>
            <surname>David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Avi</given-names>
            <surname>Segal</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ya'akov (Kobi) Gal</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Sequencing educational content in classrooms using Bayesian knowledge tracing</article-title>
          .
          <source>In LAK</source>
          .
          <volume>354</volume>
          -
          <fpage>363</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Anindya</given-names>
            <surname>Ghose</surname>
          </string-name>
          and
          <string-name>
            <given-names>Panagiotis</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>The EconoMining project at NYU: Studying the economic value of user-generated content on the internet</article-title>
          .
          <source>Journal of Revenue and Pricing Management</source>
          <volume>8</volume>
          ,
          <fpage>2</fpage>
          -
          <lpage>3</lpage>
          (
          <year>2009</year>
          ),
          <fpage>241</fpage>
          -
          <lpage>246</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yue</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Joseph E.</given-names>
            <surname>Beck</surname>
          </string-name>
          , and
          <string-name>
            <surname>Neil</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hefernan</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Comparing Knowledge Tracing and Performance Factor Analysis by Using Multiple Model Fitting Procedures</article-title>
          .
          <source>In ITS</source>
          .
          <volume>35</volume>
          -
          <fpage>44</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Gunning</surname>
          </string-name>
          .
          <year>1952</year>
          .
          <article-title>The technique of clear writing</article-title>
          . (
          <year>1952</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>William</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hawkins</surname>
          </string-name>
          , Neil T. Hefernan, and Ryan Shaun Joazeiro de Baker.
          <year>2014</year>
          .
          <article-title>Learning Bayesian Knowledge Tracing Parameters with a Knowledge Heuristic and Empirical Probabilities</article-title>
          .
          <source>In ITS</source>
          .
          <volume>150</volume>
          -
          <fpage>155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Ruining</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering</article-title>
          .
          <source>In WWW</source>
          .
          <volume>507</volume>
          -
          <fpage>517</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hong</surname>
            <given-names>Hong</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Di</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Research of online review helpfulness based on negative binary regress model the mediator role of reader participation</article-title>
          .
          <source>In 2015 12th International Conference on Service Systems and Service Management (ICSSSM)</source>
          .
          <article-title>1-5</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jingxian</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Ulrike Gretzel, and
          <string-name>
            <given-names>Rob</given-names>
            <surname>Law</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Do Negative Experiences Always Lead to Dissatisfaction? - Testing Attribution Theory in the Context of Online Travel Reviews</article-title>
          . In ENTER.
          <volume>297</volume>
          -
          <fpage>308</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J Peter</given-names>
            <surname>Kincaid</surname>
          </string-name>
          , Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom.
          <year>1975</year>
          .
          <article-title>Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel</article-title>
          .
          <source>Technical Report. Naval Technical Training Command Millington TN Research Branch.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Nikolaos</surname>
            <given-names>Korfiatis</given-names>
          </string-name>
          , Elena García-Bariocanal, and
          <string-name>
            <surname>Salvador</surname>
          </string-name>
          Sánchez-Alonso.
          <year>2012</year>
          .
          <article-title>Evaluating content quality and helpfulness of online product reviews: The interplay of review helpfulness vs. review content</article-title>
          .
          <source>Electronic Commerce Research and Applications</source>
          <volume>11</volume>
          ,
          <issue>3</issue>
          (
          <year>2012</year>
          ),
          <fpage>205</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yang</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Xiangji Huang,
          <string-name>
            <given-names>Aijun</given-names>
            <surname>An</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xiaohui</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Modeling and Predicting the Helpfulness of Online Reviews</article-title>
          . In ICDM.
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Julian John McAuley</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews</article-title>
          .
          <source>In WWW</source>
          .
          <volume>897</volume>
          -
          <fpage>908</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Susan</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mudambi</surname>
            and
            <given-names>David</given-names>
          </string-name>
          <string-name>
            <surname>Schuf</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>What Makes a Helpful Online Review? A Study of Customer Reviews on Amazon.com</article-title>
          .
          <source>MIS Quarterly 34</source>
          ,
          <issue>1</issue>
          (
          <year>2010</year>
          ),
          <fpage>185</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Philip</surname>
            <given-names>I. Pavlik</given-names>
          </string-name>
          , Hao Cen, and
          <string-name>
            <surname>Kenneth</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Koedinger</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Performance Factors Analysis - A New Alternative to Knowledge Tracing</article-title>
          .
          <source>In AIED</source>
          .
          <volume>531</volume>
          -
          <fpage>538</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Radek</given-names>
            <surname>Pelánek</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Metrics for Evaluation of Student Models</article-title>
          .
          <source>In EDM. 19.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Chris</surname>
            <given-names>Piech</given-names>
          </string-name>
          , Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami,
          <string-name>
            <given-names>Leonidas J.</given-names>
            <surname>Guibas</surname>
          </string-name>
          , and
          <string-name>
            <surname>Jascha</surname>
          </string-name>
          Sohl-Dickstein.
          <year>2015</year>
          .
          <article-title>Deep Knowledge Tracing</article-title>
          .
          <source>In NIPS</source>
          .
          <volume>505</volume>
          -
          <fpage>513</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>RJ</given-names>
            <surname>Senter and Edgar A Smith</surname>
          </string-name>
          .
          <year>1967</year>
          .
          <article-title>Automated readability index</article-title>
          .
          <source>Technical Report. Univ. Cincinati.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Kristina</surname>
            <given-names>Toutanova</given-names>
          </string-name>
          , Dan Klein,
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yoram</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network</article-title>
          .
          <source>In HLT-NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Yutao</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <surname>Neil</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hefernan</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Extending Knowledge Tracing to Allow Partial Credit: Using Continuous versus Binary Nodes</article-title>
          . In AIED.
          <volume>181</volume>
          -
          <fpage>188</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Jianan</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Review popularity and review helpfulness: A model for user review efectiveness</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>97</volume>
          (
          <year>2017</year>
          ),
          <fpage>92</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Fei Wu</surname>
          </string-name>
          , Hans van der Heijden, and
          <string-name>
            <given-names>Nikolaos</given-names>
            <surname>Korfiatis</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>The Influences of Negativity and Review Quality on the Helpfulness of Online Reviews</article-title>
          .
          <source>In ICIS.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>