<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mickaël Poussevin</string-name>
          <email>mickael.poussevin@lip6.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Guigue</string-name>
          <email>vincent.guigue@lip6.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Gallinari</string-name>
          <email>patrick.gallinari@lip6.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sorbonne-Universités UPMC, LIP6 UMR 7606 CNRS</institution>
          ,
          <addr-line>4 Place Jussieu, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sorbonne-Universités UPMC, LIP6 UMR 7606 CNRS</institution>
          ,
          <addr-line>4 Place Jussieu, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sorbonne-Universités UPMC, LIP6 UMR 7606 CNRS</institution>
          ,
          <addr-line>4 Place Jussieu, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>20</volume>
      <issue>2015</issue>
      <abstract>
        <p>We propose to augment rating based recommender systems by providing the user with additional information which might help him in his choice or in the understanding of the recommendation. We consider here as a new task, the generation of personalized reviews associated to items. We use an extractive summary formulation for generating these reviews. We also show that the two information sources, ratings and items could be used both for estimating ratings and for generating summaries, leading to improved performance for each system compared to the use of a single source. Besides these two contributions, we show how a personalized polarity classi er can integrate the rating and textual aspects. Overall, the proposed system o ers the user three personalized hints for a recommendation: rating, text and polarity. We evaluate these three components on two datasets using appropriate measures for each task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The emergence of the participative web has enabled users
to easily give their sentiments on many di erent topics. This
opinionated data ow thus grows rapidly and o ers
opportunities for several applications like e-reputation
management or recommendation. Today many e-commerce
websites present each item available on their platform with a
description of its characteristics, average appreciation,
ratings together with individual user reviews explaining their
ratings.</p>
      <p>
        Our focus here is on user - item recommendation. This is a
multifaceted task where di erent information sources about
users and items could be considered and di erent
recommendation information could be provided to the user. Despite
this diversity, the academic literature on recommender
systems has focused only on a few speci c tasks. The most
popular one is certainly the prediction of user preferences
given their past rating pro le. These systems typically rely
on collaborative ltering [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to predict missing values in a
user/item/rating matrix. In this perspective of rating
preInputs
User reviews:
text and rating
U Items
s
e
r
s
Classic recommender systems
      </p>
      <p>Latent profiles
= x</p>
      <p>Improving
rating predictions
?
Rating prediction</p>
      <p>Legend</p>
      <p>User</p>
      <p>Rating</p>
      <p>Item</p>
      <p>Text
User text profiles</p>
      <p>Item text profiles
?</p>
      <p>
        Personnalized
reviews
summary
diction, some authors have make use of additional
information sources available on typical e-commerce sites. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
proposed to extract topics from consumer reviews in order to
improve ratings predictions. Recently, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed to learn
a latent space common to both textual reviews and product
ratings, they showed that rating prediction was improved
by such hybrid recommender systems. Concerning the
information provided to the user, some models exploit review
texts for ranking comments that users may like [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or for
answering speci c user queries [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        We start here from the perspective of predicting user
preference and argue that the exploitation of the information
present in many e-commerce sites, allows us to go beyond
simple rating prediction for presenting users with
complementary information that may help him making his choice.
We consider as an example the generation of a personalized
review accompanying each item recommendation. Such a
review is a source of complementary evidence for the user
appreciation of a suggestion. Similarly as it is done for the
ratings, we exploit past information and user similarity in
order to generate these reviews. Since pure text generation
is a very challenging task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we adopt an extractive
summary perspective: the generated text accompanying each
rating will be extracted from the reviews of selected users
who share similar tastes and appreciations with the target
user. Ratings and reviews being correlated, this aspect could
also be exploited to improve the predictions. Our rating
predictor will make use of user textual pro les extracted from
their reviews and summary extraction in turn will use
predicted ratings. Thus both types of information, predicted
ratings and generated text reviews, are o ered to the user
and each prediction, rating and generated text, takes into
account the two sources of information. Additional
information could also be provided to the user. We show here
as an example, that predicted ratings and review texts can
be used to train a robust sentiment classi er which provides
the user with a personalized polarity indication about the
item. The modules of our system are evaluated on the two
main tasks, rating prediction and summary extraction, and
on the secondary task of sentiment prediction. For this,
experiments are conducted on real datasets collected from
amazon.com and ratebeer.com and models are compared to
classical baselines.
      </p>
      <p>The recommender system is compared to a classic
collaborative ltering model using the mean squared error metric.
We show that using both ratings and user textual pro les
allows us to improve the performance of a baseline
recommender. Gains are motivated from a more precise
understanding of the key aspects and opinions included in the
item and user textual pro les. For evaluating summary text
generation associated to a couple (user, item), we have at
our disposal a gold standard, the very review text written
by this user on the item. Note that this is a rare situation
in summary evaluation. However contrarily to collaborative
ltering, there is no consensual baseline. We then compare
our results to a random model and to oracle optimizing the
ROUGE-n metric. They respectively provide a lower and
an upper bound of the attainable performance. The
sentiment classi er is classically evaluated using classi cation
accuracy.</p>
      <p>This article is organized as follows. The hybrid
formulation, the review generator and the sentiment classi er are
presented in section 2. Then, section 3 gives an extensive
experimental evaluation of the framework. The overall gains
associated to hybrid models are discussed in section 4. A
review of related work is provided in section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>MODELS</title>
      <p>
        In this section, after introducing the notations used
throughout the paper, we will describe successively the three
modules of our system. We start by considering the prediction
of ratings [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Rating predictors answer the following
question: what rating will this user give to this item? We present
a simple and e cient way to introduce text pro les
representing the writing style and taste of the user in a hybrid
formulation. We then show how to exploit reviews and
ratings in a new challenging task: what text will this user write
about this item? We propose an extractive summary
formulation of this task. We then proceed to describe how both
ratings and text could be used together in a personalized
sentiment classi er.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Notations</title>
      <p>We use u (respectively i) to refer to everything related to
a user (respectively to an item) and the rating given by user
u to the item i is denoted rui. U and I refer to anything
relative to all users and all items, such as the rating matrix
RUI . Similarly, lower case letters are used for scalars or
vectors and upper case letters for matrices. dui is the actual
review text written by user u for item i. It is composed of
ui sentences: dui = fsuik; 1 k uig. In this work,
we consider documents as bags of sentences. To simplify
notations, suik is replaced by sui when there is no
ambiguity. Thus, user appreciations are quadruplets (u; i; rui; dui).
Recommender systems use past information to compute a
rating prediction r^ui, the corresponding prediction function
is denoted f (u; i).</p>
      <p>For the experiments, ratings and text reviews are split
into training, validation and test sets respectively denoted
Strain, Sval and Stest and containing mtrain, mval and mtest
user appreciations (text and rating). We denote St(rua)in, the
subset of all reviews Strain that were written by user u and
mt(rua)in the number of such reviews. Similarly, St(ri)ain and
mt(ri)ain are used for the reviews on item i.
2.2</p>
      <p>Hybrid recommender system with text
profiles</p>
      <p>Recommender systems classically use rating history to
predict the rating r^ui that user u will give to item i. The
hybrid system described here makes use of both collaborative
ltering through matrix factorization and textual
information to produce a rating as described in (1):
f (u; i) =
+ u + i + u: i + g(u; i)
(1)</p>
      <p>The rst three predictors in equation (1) are biases
(overall bias, user bias and item bias). The fourth predictor is
a classical matrix factorization term. The novelty of our
model comes from the fth term (1) that takes into account
text pro les to re ne the prediction f . Our aim for the
rating prediction is to minimize the following empirical loss
function:</p>
      <p>argmin
; u; i; u; i;g</p>
      <p>L =</p>
      <p>1
mtrain Strain</p>
      <p>X (rui
f (u; i))2
(2)
To simplify the learning procedure, we rst optimize the
parameters of the di erent components independently as
described in the following subsections. Then we ne tune the
combination of these components by learning weighting
coe cients so as to maximize the performance criterion (2) on
the validation set.
2.2.1</p>
      <sec id="sec-3-1">
        <title>Matrix factorization</title>
        <p>We rst compute the di erent bias from eq. (1) as the
averaged ratings over their respective domains (overall, user
and item). For the matrix factorization term, we
approximate the rating matrix RUI using two latent factors: RUI</p>
        <p>U IT . Both U and I are two matrices representing
collections of latent pro les, with one pro le per row. We denote
u (resp. i) the row of U (resp. I ) corresponding to the
latent pro le of user u (resp. item i).</p>
        <p>The pro les are learned by minimizing, on the training set,
the mean squared error between known ratings in matrix
RUI and the approximation provided by the factorization</p>
        <p>U IT . This minimization problem described in equation
(3), with an additional L2 constraint (4) on the factors is
solved here using non-negative matrix factorization.</p>
        <p>U ; I = argmin</p>
        <p>U ; I
kMtrain
(RUI</p>
        <p>U I )k2F
+ U k U k2F + I k I kF
2
(3)</p>
        <p>In this equation Mtrain is a binary mask that has the
same dimensions as matrix RUI , an entry is 1 only if the
corresponding review is in the training set, is the
elementwise product and k kF denotes the Frobenius norm.</p>
        <p>Let us denote u the text pro le of user u and t( u0 ; u)
a similarity operator between user pro les. The last
component of the predictor f in (1) is a weighted average of user
ratings for item i, where weight t( u0 ; u) is the similarity
between the text pro les u0 and u of users u0 and u, the
latter being the target user. This term takes into account
the fact that two users with similar styles or using similar
expressions in their appreciation of an item, should share close
ratings on this item. The prediction term for the user/item
couple (u; i) is then expressed as a weighted mean:
1
mt(ri)ain St(ri)ain
g(u; i) =</p>
        <p>X ru0i t( u0; u)
(5)</p>
        <p>Two di erent representations for the text pro les u of
the users are investigated in this article: one is based on
a latent representation of the texts obtained by a neural
network autoencoder, the other relies on a robust bag of
words coding. Each one is associated to a dedicated metric
t.</p>
        <p>This leads to two formulations of g, and thus, to two rating
prediction models. We denote the former fA (autoencoder)
and the latter fT (bag of words). Details are provided below.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Bag of words.</title>
        <p>A preprocessing step removes all words appearing in less
than 10 documents. Then, the 100 000 most frequent words
are kept. Although the number of features is large, the
representation is sparse and scales well. u is simply the binary
bag of words of all texts of user u. In this high dimensional
space, the proximity in style between two users is well
described by a cosine function, a high value indicates similar
usage of words:
t( u0 ; u) = u0 u=(k u0 kk uk)
(6)</p>
      </sec>
      <sec id="sec-3-3">
        <title>Autoencoder.</title>
        <p>The neural network autoencoder has two components: a
coding operator and a decoding operator denoted
respectively cod and dec. The two vectorial operators are learned
so as to enable the reconstruction of the original text after
a projection in the latent space. Namely, given a sentence
suik represented as a binary bag of words vector, we obtain
a latent pro le suik = cod(suik) and then, we reconstruct
an approximation of the sentence using s^uik = dec( suik ).</p>
        <p>The autoencoder is optimized so as to minimize the
reconstruction error over the training set:
cod ; dec = argmin X
cod;dec Strain
1 Xui
ui k=1
(7)</p>
        <p>
          We use the settings proposed in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]: our dictionary is
obtained after stopwords removal and selecting the most
frequent 5000 words. we did not use a larger dictionary such as
the one used for the bag of word representation since it does
not lead to improved performance and simply increases the
ksuik
dec(cod(suik))k2
computational load. All sentences are represented as binary
bag of words using this dictionary. The coding dimension
has been set to 1000 after a few evaluation trials. Note that
the precise value of this latent space is not important and the
performance is similar on a large range of dimension values.
1
Both cod and dec use sigmoid units sig(t) = 1+exp( t) :
cod(suik) = uik = sig(W suik + b)
dec( uik) = sig(W T uik + b0)
(8)
        </p>
        <p>Here, uik is a vector, W is a 5000x1000 weight matrix
and sig() is a pointwise sigmoid operator operating on the
vector W suik + b.</p>
        <p>
          As motivated in [
          <xref ref-type="bibr" rid="ref11 ref5">11, 5</xref>
          ], such a latent representation helps
exploiting term co-occurrences and thus introduces some
semantic. It provides a robust text representation. The hidden
activity of this neural network produces a continuous
representation for each sentence accounting for the presence or
absence of groups of words.
        </p>
        <p>u is obtained by coding the vector corresponding to all
text written by the user u in the past. It lies in a latent
word space where a low Euclidean distance between users
means a similar usage of words. Thus, for the similarity t,
we use an inverse Euclidean distance in the latent space:
t( u0 ; u) = 1=( + k u0
uk)
(9)
2.2.3</p>
      </sec>
      <sec id="sec-3-4">
        <title>Global training criterion for ratings prediction</title>
        <p>In order to connect all the elementary components
described above with respect to our recommendation task, we
introduce (positive) weighting parameters in (1). Thus,
the initial optimization problem (2) becomes:
= argmin mtr1ain P</p>
        <p>Strain
rui
1</p>
        <p>+ 2 u + 3 i + 4 u: i + 5g(u; i)</p>
        <p>The linear combination is optimized using a validation set:
this step guaranties that all components are combined in an
optimal manner.
2.3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Text generation model</title>
      <p>The goal here is to generate a review text for each (u,i)
recommendation. During the recommendation process, this
text is an additional information for users to consider. It
should catch their interest and in principle be close to the
one that user u could have written himself on item i. Each
text is generated as an extractive summary, where the
extracted sentences su0i come from the reviews written by
other users (u0 6= u) about item i. Sentence selection is
performed according to a criterion which combines a
similarity between the sentence and the textual user pro le and
a similarity between the actual rating ru0i and the
prediction made for (u,i), r^ui computed as described in section 2.2.
The former measure could take into account several
dimensions like vocabulary, sentiment expression and even style,
here it is mainly the vocabulary which is exploited. The
latter measures the proximity between user tastes. For the
text measure, we make use of the t similarity introduced
in section 2.2. As before, we will consider two
representations for texts (latent coding and raw bag of words). For the
ratings similarity, we use r(ru0i; rui) = 1=(1 + jru0i ruij).</p>
      <p>Suppose one wants to select a single sentence for the
extracted summary. The sentence selection criterion will then
be a simple average of the two similarities:
2 (10)
h(su0i; ru0i; u0; u; i) =
t(su0i; u) + r(ru0i; r^ui)
2
(11)</p>
      <p>Note that this function may score any piece of text. In
the following, we then consider three possibilities for
generating text reviews: The rst one simply consists in selecting
the best sentence su0i among all the training sentences for
item i with respect to h. We call it 1S for single sentence.
The second one selects a whole review du0i among all the
reviews for i. The document is here considered as one long
sentence. This is denoted CT for complete text. The third
one is a greedy procedure that selects multiple sentences, it
is denoted XS. It is initialized with 1S, and then sentences
are selected under two criteria: relevance with respect to h
and diversity with respect to the sentences already selected.
Selection is stopped when the length of the text is greater
than the average length of the texts of the target user.
Algorithm 1 sums up the XS procedure for generating the text
^
dui for the couple user u, item i.</p>
      <p>Data: u, i, S = f(su0i; ru0i u0g
Result: d^ui
su0i argmax h(su0i; ru0i; u0; u; i) ;</p>
      <p>su0i2S
d^ui su0i;
Remove su0i from S;
while length d^ui &lt; averagelength(u) do
su0i
argmax h(su0i; ru0i; u0; u; i)
su0i2S
d^ui su0i;
Remove su0i from S;
cos(su0i; d^ui) ;
end
Algorithm 1: XS greedy procedure: selection of successive
sentences to maximize both relevance and diversity. d^ui is
the text that is generated, sentence after sentence.
2.4</p>
    </sec>
    <sec id="sec-5">
      <title>Sentiment prediction model</title>
      <p>We show here how polarity information about an item can
be estimated by exploiting both the user predicted ratings
and his textual pro le. Exploiting both information sources
improves the sentiment prediction performance compared
with a usual text based sentiment classi er.</p>
      <p>
        Polarity classi cation is the task of predicting whether a
text dui (here of a review) is positive or negative. We use as
ground truth the ratings rui and follow a standard
thresholding procedure [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]: reviews rated 1 or 2 are considered
as negative, while items rated 4 or 5 are positive. All texts
that are rated 3 are ignored as it is unclear whether that
are positive or negative: it strongly depends on the rating
habits of the user.
      </p>
      <p>For evaluation purpose, we consider two baselines. A rst
one only uses the rating prediction of our recommender
system f (u; i) as a label prediction, this value is then
thresholded as indicated above. A second one is a classical text
sentiment classi er. Denoting by dui the binary bag of word
representation of a document and cui the binary label
associated to the rating rui, one uses a linear SVM s(dui) = dui:w.
Note that this is usually a strong baseline for the polarity
classi cation task. Our nal classi er will combine f (u; i)
and s(dui) in order to solve the following optimization
probSource</p>
      <p>Subset names</p>
      <p>#Users #Items</p>
      <p>All three modules, ratings, text, sentiments, are evaluated
independently since there is no global evaluation framework.
These individual performances should however provide
together a quantitative appreciation of the whole system.</p>
      <p>
        We use two real world datasets of user reviews, collected
from amazon.com [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and ratebeer.com [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Their
characteristics are presented in table 1.
      </p>
      <p>Below, one presents rst how datasets are preprocessed
in 3.1. The bene ts of incorporating the text in the ratings
prediction for the recommender system are then discussed in
section 3.2. The quality of the generated reviews is evaluated
and analyzed in section 3.3 Finally, the performance of the
sentiment classi er combining text and ratings is described
in 3.4.
for the autoencoder, we have selected the 5000 most
frequent words, with a stopwords removal step; The
autoencoder input vector is then a binary vector of
dimension 5000.
for the raw representation, we have selected the 100000
most frequent words appearing in more than 10
documents (including stopwords) and used a binary vector
representation.</p>
      <p>For the experiments, we consider several subsets of the
databases with di erent numbers of users and items. Each
dataset is built by extracting, for a given number of users
and items, the most active users and the most commented
items. Dataset characteristics are given in table 1.</p>
    </sec>
    <sec id="sec-6">
      <title>Recommender system evaluation</title>
      <p>Let us rst consider the evaluation of the rating
prediction. The metric used here is the mean squared error (MSE)
between rating predictions r^ui and actual ratings rui. The
lower the MSE is, the better the model is able to estimate
the correspondence between user tastes and items. Results
are presented in table 2.</p>
      <p>The models are referenced using the notations introduced
in section 2.2. The rst column corresponds to a trivial
system which predicts the overall bias, the second predicts
the user bias u. Both give poor performance as expected.</p>
      <p>The third column corresponds to the item bias i
baseline. It assumes that user taste is not relevant and that each
item has its own intrinsic quality. The improvement with
respect to and u is important since MSE is halved. The
fourth column corresponds to a nonnegative matrix
factorization baseline, denoted u: i. It jointly computes latent
representations for user tastes and items characteristics.
Unsurprisingly, it is our best baseline.</p>
      <p>It could be noted that performance tends to degrade when
the subset size increases. This is a side e ect associated to
the review selection process used for building the di erent
datasets. Smaller datasets contain the most active users and
the most commented items. The estimation of their pro les
bene ts from the high number of reviews per user (and item)
in this context.</p>
      <p>The last two columns refer to our hybrid recommender
systems, using the two text representations introduced in
section 2.2. Both fA (autoencoder) and fT (raw text)
perform better than a baseline collaborative ltering system
and both have similar approximation errors. The main
difference between the systems comes from the complexity of
the approach: during the learning step, fT is much faster
than fA given the fact that no autoencoder optimization is
required. On top of that, fT remains faster in the inference
step: the inherent sparsity of the bag of word representation
enables fT to provide faster computations than fA. The
autoencoder works in a smaller dimensional space but it is not
sparse.
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>Text generation evaluation</title>
      <p>We move on now to the evaluation of the personalized
review text generation module. Since we are using an
extractive summary procedure, we make use of a classical loss
used for summarization systems: we use a recall-oriented
ROUGE-n metrics, by comparing the generated text against
the actual text of the review produced by the user. As far as
we know, generating candidate reviews has never been dealt
with in this context and this is a novel task. The ROUGE-n
metric is the proportion of n-grams of the actual text found
in the predicted (candidate) text, we use n = f1; 2; 3g. The
higher ROUGE-n is, the better the quality of the candidate
text is. A good ROUGE-1 means that topics or
vocabulary are correctly caught while ROUGE-2 and ROUGE-3
are more representative of the user's style.</p>
      <p>A rst baseline is given by using a random scoring function
h (instead of the formulation given in (11)). It provides a
lower bound of the performance. Three oracles are then
used to provide an upper bound on the performance. They
directly optimize the metrics ROUGE-n from the data on
the test set.A matrix factorization baseline is also used. It is
a special case of our model where no text information is used.
This model computes a similar score for all the sentences of
a given user and relative to an item. When one sentence
only is selected, it is taken at random among the sentences
of this user for the item. With greedy selection, the rst
sentence is chosen at random and then the cosine diversity
term (algorithm 1) allows a ranking of the next candidate
sentences. Our proposed method is evaluated with the two
di erent user pro le u representation (auto-encoder and
raw text). The performance of these seven models on the
the two biggest datasets with respect to the three metrics
are aggregated in gure 2.</p>
      <p>An histogram corresponds to a text selection entity (whole
review text, best single sentence, greedy sentence selection.
Groups in the histograms (respectively row block of the
tables) are composed of three cells corresponding respectively
to the ROUGE-1, -2, -3 metrics. Not surprisingly, the results
for the single sentence selection procedure (1S) are always
worse than for the other two (CT: complete review and XS:
multiple sentences). This is simply because a sentence
contains fewer words than a full review and it can hardly share
more n-grams than the full text with the reference text. For
the ratebeer.com datasets, selecting a set of sentences clearly
o ers a better performance than selecting a whole review in
all cases. Texts written to describe beers also describe the
tasting experience. Was it in a bar or at home ? Was it a
bottle or on tap ? Texts of the community share the same
structure and vocabulary to describe both the tasting and
the avors of the beer. Most users write short and precise
sentences. This is an appropriate context for our sentence
scoring model, where the habits of users are caught by our
recommender systems. The performance slightly decreases
when the size of the dataset is increased. As before, this is
in accordance with the selection procedure of these datasets
which focuses rst on the most active users and commented
items. For Amazon, the conclusion is not so clear and
depending on the conditions, either whole reviews or selected
0.15
0.10
sentences get the best score. It is linked to the higher variety
in the community of users on the website: well structured
sentences like those present in RateBeer are here mixed here
with di erent levels of English and troll reviews.</p>
      <p>The di erent models, overall, are following a clear
hierarchy. First, stating the obvious, the random model has the
worst performance. Then, using a recommender system to
select relevant sentences helps in terms of ROUGE-n
performance. Using the text information brings most of the
time only a small score improvement. Overall our models
only o er small improvements here with respect to random
or NMF text selection (i.e. based on rating similarity only).
After analyzing this behavior, we believe that this is due to
the shortness of the text reviews, to their relatively
standardized form (arguments are very similar from one review
to another), to the peaked vocabulary distribution of the
reviews, and to the nature of ROUGE. The latter is a classical
recall oriented summarization evaluation measure, but does
not distinguishes well between text candidates in this
context. This also shows that there is room for improvement
on this aspect.</p>
      <p>Concerning the oracle several conclusions can be drawn.
For both single sentence and complete text selection, the gap
between the ROUGE measures and the proposed selection
method is important suggesting that there is still room for
improvements here too. For the greedy sentence selection,
the gap between the oracles and the hybrid recommender
systems is moderate suggesting that the procedure is here
fully e cient. However this conclusion should be moderated.
It can be observed that whereas, ROUGE is e ectively an
upper bound for single sentence or whole review selection,
this is no more the case for multiple sentences selection.
Because of the complexity of selecting the best subset of
sentences according to a loss criterion (which amounts at a
combinatorial selection problem) we have been using a
suboptimal forward selection procedure: we rst select the best
ROUGE sentence, then the second best, etc. In this case
the ROUGE procedure is no more optimal.</p>
      <p>Concerning the measures, the performance decreases rapidly
when we move from ROUGE-1 to ROUGE-2, 3. Given the
problem formulation and the context of short product
reviews, ROUGE-2,3 are clearly too constraining and the
corresponding scores are not signi cant.
3.4 Sentiment classification evaluation</p>
      <p>
        The performance of the di erent models, using the
sentiment classi cation error as an evaluation metric, are
presented in table 3. Because they give very poor performance,
the bias recommendation models ( and u) are not
presented here. The item bias i, second column, gives a
baseline, which is improved by the matrix factorization u: i,
third column. Our hybrid models fA, fourth column, and
fT , fth column, have lower classi cation errors than all
the other recommender systems. The rst column, LL is
the linear support vector machine (SVM) baseline. It has
been learnt on the training set texts, and the regularization
hyperparameter has been selected using the validation set.
Our implementation relies on liblinear (LL) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Its performance is better than the recommender systems
but it should be noted that it makes use of the actual text
dui of the review, whereas the recommender systems only
use past information regarding user u and item i. Note
that even in this context, the recommender performance on
RateBeer is very close to the SVM baseline.</p>
      <p>It is then possible to combine the two models, according
to the formulation proposed in section 2.4. The resulting
hybrid approaches, denoted LL + fA and LL + fT , exploit
both text based decision (SVM) and user pro le (fA and fT ).
This combined model shows good classi cation performance
and overcomes the LL baseline in 4 out of 7 experiments
in table 3, while performing similarly to LL in the other 3
experiments.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>OVERALL GAINS</title>
      <p>In order to get a global vision of the overall gain provided
by the proposed approach, we summarize here the results
obtained on the di erent tasks. For each task, the gain with
respect to the (task dependent) baseline is computed and
averaged (per task) over all datasets. The metric depends
on the task. Results are presented in gure 3.</p>
      <p>Gain in % w.r.t. random on rouge-n
10
0
10
20
30
40
50
60
70 μ
μu
μi</p>
      <p>Gain in % w.r.t. MSE of γu.γi</p>
      <p>fA fT
(a) Recommender systems.</p>
      <p>Baseline=matrix factorization
0 γu.γi fA</p>
      <p>fT rouge-1 rouge-2 rouge-3
(b) Summarizers.</p>
      <p>Baseline=random selection procedure</p>
      <p>Gain in % w.r.t. % Good classification of LL
70 μi γu.γi fA fT LL + fA LL + fT
(c) Opinion classi ers.</p>
      <p>Baseline=SVM</p>
      <p>For the mean squared error metric ( gure 3a) the matrix
factorization is used as baseline. The user bias u heavily
fails to generalize on two datasets. The item bias is closer
to the baseline ( 11:43%). Our hybrid models, which uses
texts to re ne user and item pro les bring a gain of 5:71%
for fA, 5:63% for fT . This demonstrates the interest of
including textual information in the recommender system.
Autoencoder and raw text approaches o er similar gains,
the latter approach being overall faster.</p>
      <p>For the text generation, we take the random model as
baseline and results are presented in gure 3b. The gain is
computed for the three investigated framework (CT: review
selection, 1S: one sentence selection, XS: multiple sentence
selection) and per measure (ROUGE-1, 2, 3) and then
averaged to one overall gain. ROUGE-n oracles clearly
outperform other models, which seems intuitive. The di
erent recommender systems have very close behaviors with
respective gains of 11:15% (matrix factorization), 11:89%
(auto-encoder), 11:83% (raw text). Here textual
information helps but does not clearly dominate ratings and
provide only a small improvement. Remember that although
performance improvement with respect to baselines is
desirable, the main novelty of the approach here is to propose
a personalized summary generation together with the usual
rating prediction.</p>
      <p>For the opinion classi er, presented in gure 3c, the
baseline consists in a linear SVM. Basic recommender systems
perform poorly with respect to the baseline (LL).
Surprisingly, the item bias i ( 68:71%) performs slightly better
than matrix factorization u: i ( 69:54%) in the context of
sentiment classi cation (no neutral reviews and binary
ratings). Using textual information increases the performance.
The autoencoder based model fA ( 57:17%) and raw text
approach fT ( 58:31%) perform similarly. As discussed in
3.4, the linear SVM uses the text of the current reviews when
the recommender systems does not. As a consequence, it is
worth combining both predictions in order to exploit text
and past pro les: the resulting models give respective gains
of 4:72% (autoencoder) and 3:89% (raw text) w.r.t the SVM.</p>
    </sec>
    <sec id="sec-9">
      <title>RELATED WORK</title>
      <p>Since the paper covers the topics of rating prediction,
summarization and sentiment classi cation, we brie y present
each of them.
5.1</p>
    </sec>
    <sec id="sec-10">
      <title>Recommender systems</title>
      <p>
        Three main families of recommendation algorithms have
been developed [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: content-based knowledge-based, and
collaborative ltering. Given the focus of this work on
consumer reviews, we considered collaborative ltering. For
merchant websites the goal is to encourage users to buy new
products and the problem is usually considered either as the
prediction of a ranked list of relevant items for each user [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
or as the completion of missing ratings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We have focused
here on the latter approach for evaluation concerns: since
we use data collected from third party sources.
5.2
      </p>
      <p>
        Text summarization for consumer reviews
Early reference work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] on consumer reviews has focused
on global summarization of user reviews for each item. The
motivation of this work was to extract the sentiments
associated to a list of features from all the item review texts. The
summarization took the form of a rating or of an
appreciation of each feature. Here, contrarily to this line of work, the
focus is on personalized item summaries for a target user.
Given the di culty of producing a comprehensive synthetic
summary, we have turned this problem into a sentence or
text selection process.
      </p>
      <p>
        Evaluation of summaries is challenging: how to assess the
quality of a summary when the ground truth is subjective?
In our context, the review texts are available and we used
them as the ground truth. We have used classical ROUGE-n
summary evaluation measures [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
5.3
      </p>
    </sec>
    <sec id="sec-11">
      <title>Sentiment classification</title>
      <p>
        Di erent text latent representations have been proposed
in this scope: [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposed a generative model to represent
jointly topic and sentiments and recently, several works have
considered matrix factorization or neural network, in an
attempts to develop robust sentiment recognition systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] go further and propose to learn two types of
representation: a vectorial model is learned for word representation
together with a latent transformation model, which allows
the representation of negation and quanti ers associated to
an expression.
      </p>
      <p>
        We have investigated two kinds of representation for the
texts: bag of words and a latent representation through the
use of autoencoders as in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] also uses a latent
representation for representing reviews, although in a probabilistic
setting instead in a deterministic one like we are doing here.
5.4
      </p>
    </sec>
    <sec id="sec-12">
      <title>Hybrid approaches</title>
      <p>
        In the eld of recommendation, a rst hybrid model was
proposed by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: it is based on hand labeling of review
sentences (topic and polarity) to identify relevant
characteristics of the items. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] pushes further the exploitation of
texts, by using a joint latent representation for ratings and
textual content with the objective of improving the rating
accuracy. These two works are focused on rating prediction
and do not consider delivering additional information to the
user. Very recently, [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] has considered adding an
explanation component to a recommender system. For that, they
propose to extract some keywords from the review texts,
which are supposed to explain why a user likes or dislikes
an item. This is probably the work whose spirit is closest to
ours but they do not provide a quantitative evaluation.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] combined opinion mining and text summarization on
product reviews with the goal of extracting the qualities and
defaults. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] proposed a system for delivering personalized
answers to user queries on speci c products. They built the
user pro les relying on topic modeling without any
sentiment dimension. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed a personalized news
recommendation algorithm evaluated on the Yahoo portal using
user feedback, but it does not investigate ratings or
summarization issues. Overall, we propose in this article to go
beyond a generic summary of item characteristics by
generating for each user a personalized summaries that is close to
what they would have written about the item themselves.
      </p>
      <p>
        For a long time, sentiment classi cation has ignored the
user dimension and has focused for example on the
conception of "universal" sentiment classi ers able to deal with a
large variety of topics [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Considering the user has become
an issue only very recently. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] for example exploited
explicit relations in social graphs for improving opinion
classi ers, but their work is only focused on this aspect. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
proposed to distinguish di erent rating behaviors and show
that modeling the review authors in a scale ranging from
connoisseur to expert o ers a signi cant gain for an opinion
prediction task.
      </p>
      <p>In our work, we have experimented the bene ts of
considering the text of user reviews in recommender system for
their performance as sentiment classi er. We have
additionally proposed, as a secondary contribution, an original model
mixing recommender systems and linear classi cation.</p>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION</title>
      <p>This article proposes an extended framework to the
recommendation task. The general goal is to enrich classical
recommender systems with several dimensions. As an
example we show how to generate personalized reviews for
each recommendation using extracted summaries. This is
our main contribution. We also show how rating and text
could be used to produce e cient personalized sentiment
classi ers for each recommendation. Depending on the
application, other additional information could be brought to
the user. Besides producing additional information for the
user, the di erent information sources can take bene t one
from the other. We thus show how to e ectively make use
of text review and rating informations for building improved
rating predictors and review summaries. As already
mentioned, the sentiment classi ers also bene ts from the two
information sources. This part of the work demonstrates that
multiple information sources could be useful for improving
recommendation systems. This is particularly interesting
since several sources are e ectively available now at many
online sites. Several new applications could be developed
along the lines suggested here. From a modeling point of
view, more sophisticated approaches can be developed. We
are currently working on a multitask framework where the
representations used in the di erent components are more
closely correlated than in the present model.</p>
      <p>Acknoledgements The authors would like to thank the
AMMICO project (F1302017 Q - FUI AAP 13) for funding
our research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , BC Chen, and
          <string-name>
            <given-names>B</given-names>
            <surname>Pang</surname>
          </string-name>
          .
          <article-title>Personalized recommendation of user comments via factor models</article-title>
          .
          <source>EMNLP'11</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M</given-names>
            <surname>Amini</surname>
          </string-name>
          and
          <string-name>
            <given-names>N</given-names>
            <surname>Usunier</surname>
          </string-name>
          .
          <article-title>A contextual query expansion approach by term clustering for robust text summarization</article-title>
          .
          <source>DUC'07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R</given-names>
            <surname>Burke.</surname>
          </string-name>
          <article-title>Hybrid recommender systems: Survey and experiments</article-title>
          .
          <source>UMUAI'02</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>R-E Fan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K-W Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C-J Hsieh</surname>
          </string-name>
          ,
          <string-name>
            <surname>X-R Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>C-J Lin</surname>
          </string-name>
          .
          <article-title>Liblinear: A library for large linear classi cation</article-title>
          .
          <source>JMLR'08</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G</given-names>
            <surname>Ganu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N</given-names>
            <surname>Elhadad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A</given-names>
            <surname>Marian</surname>
          </string-name>
          .
          <article-title>Beyond the Stars: Improving Rating Predictions using Review Text Content</article-title>
          .
          <source>WebDB'09</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X</given-names>
            <surname>Glorot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Bordes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Domain adaptation for large-scale sentiment classi cation: A deep learning approach</article-title>
          .
          <source>In ICML'11</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Minqing</given-names>
            <surname>Hu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Mining and summarizing customer reviews</article-title>
          .
          <source>KDD '04, page 168</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N</given-names>
            <surname>Jindal</surname>
          </string-name>
          and
          <string-name>
            <given-names>B</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Opinion spam and analysis</article-title>
          .
          <source>In WSDM</source>
          , pages
          <volume>219</volume>
          {
          <fpage>230</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          , Robert Bell, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          .
          <source>Computer</source>
          , pages
          <volume>42</volume>
          {
          <fpage>49</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chin-Yew Lin</surname>
          </string-name>
          .
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          .
          <source>In ACL Workshop: Text Summarization Branches Out</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J</given-names>
            <surname>McAuley</surname>
          </string-name>
          and
          <string-name>
            <given-names>J</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Hidden factors and hidden topics: understanding rating dimensions with review text</article-title>
          .
          <source>RecSys'13</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>JJ</given-names>
            <surname>McAuley and J Leskovec</surname>
          </string-name>
          .
          <article-title>From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews</article-title>
          .
          <source>WWW'13</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Matthew R McLaughlin and Jonathan L Herlocker</surname>
          </string-name>
          .
          <article-title>A Collaborative Filtering Algorithm and Evaluation Metric That Accurately Model the User Experience</article-title>
          .
          <source>In SIGIR'04</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Wondra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>and CX Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs</article-title>
          .
          <source>In WWW. ACM</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B</given-names>
            <surname>Pang and L Lee</surname>
          </string-name>
          .
          <article-title>Opinion mining and sentiment analysis</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <surname>B Huval</surname>
          </string-name>
          , CD Manning, and
          <string-name>
            <given-names>A</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Semantic compositionality through recursive matrix-vector spaces</article-title>
          .
          <source>In EMNLP'12. ACL</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B</given-names>
            <surname>Pang</surname>
          </string-name>
          .
          <article-title>To each his own: personalized content selection based on text comprehensibility</article-title>
          .
          <source>In ICWDM'12. ACM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Zhou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>User-level sentiment analysis incorporating social networks</article-title>
          .
          <source>In KDD'11. ACM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y Liu, and
          <string-name>
            <given-names>S</given-names>
            <surname>Ma</surname>
          </string-name>
          .
          <article-title>Explicit factor models for explainable recommendation based on phrase-level sentiment analysis</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>