=Paper= {{Paper |id=Vol-1448/paper7 |storemode=property |title=Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary |pdfUrl=https://ceur-ws.org/Vol-1448/paper7.pdf |volume=Vol-1448 |dblpUrl=https://dblp.org/rec/conf/recsys/PoussevinGG15 }} ==Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary== https://ceur-ws.org/Vol-1448/paper7.pdf

Extended Recommendation Framework: Generating the
Text of a User Review as a Personalized Summary

Mickaël Poussevin Vincent Guigue Patrick Gallinari
Sorbonne-Universités UPMC Sorbonne-Universités UPMC Sorbonne-Universités UPMC
LIP6 UMR 7606 CNRS LIP6 UMR 7606 CNRS LIP6 UMR 7606 CNRS
4 Place Jussieu, Paris, France 4 Place Jussieu, Paris, France 4 Place Jussieu, Paris, France
mickael.poussevin@lip6.fr vincent.guigue@lip6.fr patrick.gallinari@lip6.fr
Inputs Legend
ABSTRACT
User Rating Item Text
We propose to augment rating based recommender systems User reviews:
by providing the user with additional information which text and rating
might help him in his choice or in the understanding of the
Classic recommender systems
recommendation. We consider here as a new task, the gen- User text profiles Item text profiles
eration of personalized reviews associated to items. We use U Items Latent profiles
an extractive summary formulation for generating these re- s
views. We also show that the two information sources, rat- e = x
r
ings and items could be used both for estimating ratings and s Improving
for generating summaries, leading to improved performance rating predictions
for each system compared to the use of a single source. Be-
sides these two contributions, we show how a personalized Personnalized
polarity classifier can integrate the rating and textual as- ? reviews
?
Rating prediction summary
pects. Overall, the proposed system offers the user three per-
sonalized hints for a recommendation: rating, text and po-
larity. We evaluate these three components on two datasets Figure 1: Our contribution is twofold: (1) improving rating
using appropriate measures for each task. predictions using textual information, (2) generating per-
sonalized reviews summaries to push recommender systems
beyond rating predictions
1. INTRODUCTION
The emergence of the participative web has enabled users
to easily give their sentiments on many different topics. This
opinionated data flow thus grows rapidly and offers oppor- diction, some authors have make use of additional informa-
tunities for several applications like e-reputation manage- tion sources available on typical e-commerce sites. [5] pro-
ment or recommendation. Today many e-commerce web- posed to extract topics from consumer reviews in order to
sites present each item available on their platform with a improve ratings predictions. Recently, [11] proposed to learn
description of its characteristics, average appreciation, rat- a latent space common to both textual reviews and product
ings together with individual user reviews explaining their ratings, they showed that rating prediction was improved
ratings. by such hybrid recommender systems. Concerning the in-
Our focus here is on user - item recommendation. This is a formation provided to the user, some models exploit review
multifaceted task where different information sources about texts for ranking comments that users may like [1] or for
users and items could be considered and different recommen- answering specific user queries [17].
dation information could be provided to the user. Despite We start here from the perspective of predicting user pref-
this diversity, the academic literature on recommender sys- erence and argue that the exploitation of the information
tems has focused only on a few specific tasks. The most present in many e-commerce sites, allows us to go beyond
popular one is certainly the prediction of user preferences simple rating prediction for presenting users with comple-
given their past rating profile. These systems typically rely mentary information that may help him making his choice.
on collaborative filtering [9] to predict missing values in a We consider as an example the generation of a personalized
user/item/rating matrix. In this perspective of rating pre- review accompanying each item recommendation. Such a
review is a source of complementary evidence for the user
appreciation of a suggestion. Similarly as it is done for the
ratings, we exploit past information and user similarity in
order to generate these reviews. Since pure text generation
is a very challenging task [2], we adopt an extractive sum-
mary perspective: the generated text accompanying each
rating will be extracted from the reviews of selected users
who share similar tastes and appreciations with the target
CBRecSys 2015, September 20, 2015, Vienna, Austria. user. Ratings and reviews being correlated, this aspect could
Copyright remains with the authors and/or original copyright holders. also be exploited to improve the predictions. Our rating pre-
dictor will make use of user textual profiles extracted from κui sentences: dui = {suik , 1 ≤ k ≤ κui }. In this work,
their reviews and summary extraction in turn will use pre- we consider documents as bags of sentences. To simplify
dicted ratings. Thus both types of information, predicted notations, suik is replaced by sui when there is no ambigu-
ratings and generated text reviews, are offered to the user ity. Thus, user appreciations are quadruplets (u, i, rui , dui ).
and each prediction, rating and generated text, takes into Recommender systems use past information to compute a
account the two sources of information. Additional infor- rating prediction r̂ui , the corresponding prediction function
mation could also be provided to the user. We show here is denoted f (u, i).
as an example, that predicted ratings and review texts can For the experiments, ratings and text reviews are split
be used to train a robust sentiment classifier which provides into training, validation and test sets respectively denoted
the user with a personalized polarity indication about the Strain , Sval and Stest and containing mtrain , mval and mtest
(u)
item. The modules of our system are evaluated on the two user appreciations (text and rating). We denote Strain , the
main tasks, rating prediction and summary extraction, and subset of all reviews Strain that were written by user u and
on the secondary task of sentiment prediction. For this, (u) (i)
mtrain the number of such reviews. Similarly, Strain and
experiments are conducted on real datasets collected from (i)
mtrain are used for the reviews on item i.
amazon.com and ratebeer.com and models are compared to
classical baselines. 2.2 Hybrid recommender system with text pro-
The recommender system is compared to a classic collab-
orative filtering model using the mean squared error metric.
files
We show that using both ratings and user textual profiles Recommender systems classically use rating history to
allows us to improve the performance of a baseline recom- predict the rating r̂ui that user u will give to item i. The
mender. Gains are motivated from a more precise under- hybrid system described here makes use of both collaborative
standing of the key aspects and opinions included in the filtering through matrix factorization and textual informa-
item and user textual profiles. For evaluating summary text tion to produce a rating as described in (1):
generation associated to a couple (user, item), we have at
our disposal a gold standard, the very review text written
f (u, i) = µ + µu + µi + γu .γi + g(u, i) (1)
by this user on the item. Note that this is a rare situation
in summary evaluation. However contrarily to collaborative The first three predictors in equation (1) are biases (over-
filtering, there is no consensual baseline. We then compare all bias, user bias and item bias). The fourth predictor is
our results to a random model and to oracle optimizing the a classical matrix factorization term. The novelty of our
ROUGE-n metric. They respectively provide a lower and model comes from the fifth term (1) that takes into account
an upper bound of the attainable performance. The sen- text profiles to refine the prediction f . Our aim for the
timent classifier is classically evaluated using classification rating prediction is to minimize the following empirical loss
accuracy. function:
This article is organized as follows. The hybrid formu-
1 X
lation, the review generator and the sentiment classifier are argmin L = (rui − f (u, i))2 (2)
presented in section 2. Then, section 3 gives an extensive ex- µ,µu ,µi ,γu ,γi ,g mtrain S
train
perimental evaluation of the framework. The overall gains
associated to hybrid models are discussed in section 4. A To simplify the learning procedure, we first optimize the pa-
review of related work is provided in section 5. rameters of the different components independently as de-
scribed in the following subsections. Then we fine tune the
combination of these components by learning weighting co-
2. MODELS efficients so as to maximize the performance criterion (2) on
In this section, after introducing the notations used through- the validation set.
out the paper, we will describe successively the three mod-
ules of our system. We start by considering the prediction 2.2.1 Matrix factorization
of ratings [11]. Rating predictors answer the following ques- We first compute the different bias from eq. (1) as the
tion: what rating will this user give to this item? We present averaged ratings over their respective domains (overall, user
a simple and efficient way to introduce text profiles repre- and item). For the matrix factorization term, we approxi-
senting the writing style and taste of the user in a hybrid mate the rating matrix RU I using two latent factors: RU I ≈
formulation. We then show how to exploit reviews and rat- ΓU ΓTI . Both ΓU and ΓI are two matrices representing collec-
ings in a new challenging task: what text will this user write tions of latent profiles, with one profile per row. We denote
about this item? We propose an extractive summary formu- γu (resp. γi ) the row of ΓU (resp. ΓI ) corresponding to the
lation of this task. We then proceed to describe how both latent profile of user u (resp. item i).
ratings and text could be used together in a personalized The profiles are learned by minimizing, on the training set,
sentiment classifier. the mean squared error between known ratings in matrix
RU I and the approximation provided by the factorization
2.1 Notations ΓU ΓTI . This minimization problem described in equation
We use u (respectively i) to refer to everything related to (3), with an additional L2 constraint (4) on the factors is
a user (respectively to an item) and the rating given by user solved here using non-negative matrix factorization.
u to the item i is denoted rui . U and I refer to anything
relative to all users and all items, such as the rating matrix
RU I . Similarly, lower case letters are used for scalars or vec- Γ∗U , Γ∗I = argmin kMtrain (RU I − ΓU ΓI )k2F (3)
ΓU ,ΓI
tors and upper case letters for matrices. dui is the actual
review text written by user u for item i. It is composed of +λU kΓU k2F + λI kΓI k2F (4)
In this equation Mtrain is a binary mask that has the computational load. All sentences are represented as binary
same dimensions as matrix RU I , an entry is 1 only if the bag of words using this dictionary. The coding dimension
corresponding review is in the training set, is the element- has been set to 1000 after a few evaluation trials. Note that
wise product and k · kF denotes the Frobenius norm. the precise value of this latent space is not important and the
performance is similar on a large range of dimension values.
2.2.2 Text profiles exploitation 1
Both cod and dec use sigmoid units sig(t) = 1+exp(−t) :
Let us denote πu the text profile of user u and σt (πu0 , πu )
a similarity operator between user profiles. The last compo- cod(suik ) = πuik = sig(W suik + b)
(8)
nent of the predictor f in (1) is a weighted average of user dec(πuik ) = sig(W T πuik + b0 )
ratings for item i, where weight σt (πu0 , πu ) is the similarity Here, πuik is a vector, W is a 5000x1000 weight matrix
between the text profiles πu0 and πu of users u0 and u, the and sig() is a pointwise sigmoid operator operating on the
latter being the target user. This term takes into account vector W suik + b.
the fact that two users with similar styles or using similar ex- As motivated in [11, 5], such a latent representation helps
pressions in their appreciation of an item, should share close exploiting term co-occurrences and thus introduces some se-
ratings on this item. The prediction term for the user/item mantic. It provides a robust text representation. The hidden
couple (u, i) is then expressed as a weighted mean: activity of this neural network produces a continuous rep-
resentation for each sentence accounting for the presence or
1 X
absence of groups of words.
g(u, i) = (i)
ru0 i σt (πu0 , πu ) (5)
mtrain (i) πu is obtained by coding the vector corresponding to all
S train text written by the user u in the past. It lies in a latent
Two different representations for the text profiles πu of word space where a low Euclidean distance between users
the users are investigated in this article: one is based on means a similar usage of words. Thus, for the similarity σt ,
a latent representation of the texts obtained by a neural we use an inverse Euclidean distance in the latent space:
network autoencoder, the other relies on a robust bag of σt (πu0 , πu ) = 1/(α + kπu0 − πu k) (9)
words coding. Each one is associated to a dedicated metric
σt . 2.2.3 Global training criterion for ratings prediction
This leads to two formulations of g, and thus, to two rating In order to connect all the elementary components de-
prediction models. We denote the former fA (autoencoder) scribed above with respect to our recommendation task, we
and the latter fT (bag of words). Details are provided below. introduce (positive) weighting parameters β in (1). Thus,
the initial optimization problem (2) becomes:
Bag of words.
β ∗ = argmin mtrain
1
P
A preprocessing step removes all words appearing in less Strain
β
than 10 documents. Then, the 100 000 most frequent words 2 (10)
are kept. Although the number of features is large, the rep- rui − β1 µ∗ + β2 µ∗u + β3 µ∗i + β4 γu∗ .γi∗ + β5 g(u, i)
resentation is sparse and scales well. πu is simply the binary
bag of words of all texts of user u. In this high dimensional The linear combination is optimized using a validation set:
space, the proximity in style between two users is well de- this step guaranties that all components are combined in an
scribed by a cosine function, a high value indicates similar optimal manner.
usage of words: 2.3 Text generation model
σt (πu0 , πu ) = πu0 πu /(kπu0 kkπu k) (6) The goal here is to generate a review text for each (u,i)
recommendation. During the recommendation process, this
text is an additional information for users to consider. It
Autoencoder. should catch their interest and in principle be close to the
The neural network autoencoder has two components: a one that user u could have written himself on item i. Each
coding operator and a decoding operator denoted respec- text is generated as an extractive summary, where the ex-
tively cod and dec. The two vectorial operators are learned tracted sentences su0 i come from the reviews written by
so as to enable the reconstruction of the original text after other users (u0 6= u) about item i. Sentence selection is
a projection in the latent space. Namely, given a sentence performed according to a criterion which combines a simi-
suik represented as a binary bag of words vector, we obtain larity between the sentence and the textual user profile and
a latent profile πsuik = cod(suik ) and then, we reconstruct a similarity between the actual rating ru0 i and the predic-
an approximation of the sentence using ŝuik = dec(πsuik ). tion made for (u,i), r̂ui computed as described in section 2.2.
The autoencoder is optimized so as to minimize the re- The former measure could take into account several dimen-
construction error over the training set: sions like vocabulary, sentiment expression and even style,
κui here it is mainly the vocabulary which is exploited. The
X 1 X
cod∗ , dec∗ = argmin ksuik − dec(cod(suik ))k2 latter measures the proximity between user tastes. For the
cod,dec
Strain
κui text measure, we make use of the σt similarity introduced
k=1
(7) in section 2.2. As before, we will consider two representa-
We use the settings proposed in [6]: our dictionary is ob- tions for texts (latent coding and raw bag of words). For the
tained after stopwords removal and selecting the most fre- ratings similarity, we use σr (ru0 i , rui ) = 1/(1 + |ru0 i − rui |).
quent 5000 words. we did not use a larger dictionary such as Suppose one wants to select a single sentence for the ex-
the one used for the bag of word representation since it does tracted summary. The sentence selection criterion will then
not lead to improved performance and simply increases the be a simple average of the two similarities:
Source Subset names #Users #Items #Reviews
#Training #Validation #Test
RB U50 I200 52 200 7200 900 906

r
σt (su0 i , πu ) + σr (ru0 i , r̂ui )

ee
h(su0 i , ru0 i , u0 , u, i) = RB U500 I2k 520 2000 388200 48525 48533

eb
(11)

at
2 RB U5k I20k 5200 20000 1887608 235951 235960

R
A U200 I120 213 122 984 123 130

on
Note that this function may score any piece of text. In A U2k I1k 2135 1225 31528 3941 3946

az
A U20k I12k 21353 12253 334256 41782 41791

m
the following, we then consider three possibilities for gener-

A
A U210k I120k 213536 122538 1580576 197572 197574
ating text reviews: The first one simply consists in selecting
the best sentence su0 i among all the training sentences for Table 1: Users, items & reviews counts for every datasets.
item i with respect to h. We call it 1S for single sentence. Subsets µ µu µi γu .γi fA fT
The second one selects a whole review du0 i among all the RB U50 I200 0.7476 0.7291 0.3096 0.2832 0.2772 0.2773
RB U500 I2k 0.6536 0.6074 0.3359 0.3168 0.3051 0.3051
reviews for i. The document is here considered as one long RB U5k I20k 0.7559 0.6640 0.3912 0.3555 0.3451 0.3451
sentence. This is denoted CT for complete text. The third A U200 I120 1.5348 2.0523 1.6563 1.7081 1.4665 1.4745
one is a greedy procedure that selects multiple sentences, it A U2k I1k 1.5316 1.4391 1.3116 1.0927 1.0483 1.0485
is denoted XS. It is initialized with 1S, and then sentences A U20k I12k 1.4711 1.4241 1.2849 1.0797 1.0426 1.0426
A U210k I120k 1.5072 2.1154 1.5318 1.2915 1.1671 1.1678
are selected under two criteria: relevance with respect to h
and diversity with respect to the sentences already selected.
Table 2: Test performance (mean squared error) for rec-
Selection is stopped when the length of the text is greater
ommendation. µ, µu , µi are the overall bias, user bias and
than the average length of the texts of the target user. Al-
item bias baselines. γu .γi is the plain matrix factorization
gorithm 1 sums up the XS procedure for generating the text
baseline. fA , fT are our hybrid recommender systems rely-
dˆui for the couple user u, item i.
ing respectively on latent and raw text representations. The
different datasets are described in table 1.
Data: u, i, S = {(su0 i , ru0 i u0 }
Result: dˆui
s∗u0 i ← argmax h(su0 i , ru0 i , u0 , u, i) ;

lem:
su0 i ∈S
X
dˆui ← s∗u0 i ; w∗ = argmin 1− dui .w+f (u, i) cui +λkwk2
+
Remove s∗u0 i from S; w
Strain ,rui 6=3

while length dˆui < averagelength(u) do (12)
s∗u0 i ← argmax h(su0 i , ru0 i , u0 , u, i) − cos(su0 i , dˆui ) ;
with (x)+ = x when x positive and (x)+ = 0 elsewhere. In
su0 i ∈S the experimental section, we will also compare the results
dˆui ← s∗u0 i ; obtained with the two versions of our rating predictor: fT
Remove s∗u0 i from S; and fA (cf section 2.2.2).
end
Algorithm 1: XS greedy procedure: selection of successive 3. EXPERIMENTS
sentences to maximize both relevance and diversity. dˆui is
All three modules, ratings, text, sentiments, are evaluated
the text that is generated, sentence after sentence.
independently since there is no global evaluation framework.
These individual performances should however provide to-
gether a quantitative appreciation of the whole system.
2.4 Sentiment prediction model We use two real world datasets of user reviews, collected
We show here how polarity information about an item can from amazon.com [8] and ratebeer.com [11]. Their charac-
be estimated by exploiting both the user predicted ratings teristics are presented in table 1.
and his textual profile. Exploiting both information sources Below, one presents first how datasets are preprocessed
improves the sentiment prediction performance compared in 3.1. The benefits of incorporating the text in the ratings
with a usual text based sentiment classifier. prediction for the recommender system are then discussed in
Polarity classification is the task of predicting whether a section 3.2. The quality of the generated reviews is evaluated
text dui (here of a review) is positive or negative. We use as and analyzed in section 3.3 Finally, the performance of the
ground truth the ratings rui and follow a standard thresh- sentiment classifier combining text and ratings is described
olding procedure [15]: reviews rated 1 or 2 are considered in 3.4.
as negative, while items rated 4 or 5 are positive. All texts
that are rated 3 are ignored as it is unclear whether that 3.1 Data preprocessing
are positive or negative: it strongly depends on the rating Reviews from different websites have different formats (rat-
habits of the user. ing scales, multiple ratings, . . . ). We focus on the global
For evaluation purpose, we consider two baselines. A first rating and scaled it to a 1 to 5 integer range. For titled
one only uses the rating prediction of our recommender sys- reviews, the title is considered as the first sentence of the
tem f (u, i) as a label prediction, this value is then thresh- text of the review. Each dataset is randomly split into three
olded as indicated above. A second one is a classical text parts: training, validation and test containing respectively
sentiment classifier. Denoting by dui the binary bag of word 80%, 10% and 10% of the reviews.
representation of a document and cui the binary label associ- As described in 2.2, two representations of the text are
ated to the rating rui , one uses a linear SVM s(dui ) = dui .w. considered each with a different dictionary:
Note that this is usually a strong baseline for the polarity
classification task. Our final classifier will combine f (u, i) • for the autoencoder, we have selected the 5000 most
and s(dui ) in order to solve the following optimization prob- frequent words, with a stopwords removal step; The
autoencoder input vector is then a binary vector of step: the inherent sparsity of the bag of word representation
dimension 5000. enables fT to provide faster computations than fA . The au-
toencoder works in a smaller dimensional space but it is not
• for the raw representation, we have selected the 100000 sparse.
most frequent words appearing in more than 10 docu-
ments (including stopwords) and used a binary vector
representation.
3.3 Text generation evaluation
We move on now to the evaluation of the personalized
For the experiments, we consider several subsets of the review text generation module. Since we are using an ex-
databases with different numbers of users and items. Each tractive summary procedure, we make use of a classical loss
dataset is built by extracting, for a given number of users used for summarization systems: we use a recall-oriented
and items, the most active users and the most commented ROUGE-n metrics, by comparing the generated text against
items. Dataset characteristics are given in table 1. the actual text of the review produced by the user. As far as
we know, generating candidate reviews has never been dealt
Subsets LL µi γu .γi fA fT LL + fA LL + fT with in this context and this is a novel task. The ROUGE-n
RB U50 I200 5.35 5.12 6.01 5.57 5.57 3.79 3.79
RB U500 I2k 7.18 10.67 9.73 8.55 8.55 6.52 6.92 metric is the proportion of n-grams of the actual text found
RB U5k I20k 8.44 11.80 10.04 9.17 9.17 8.33 8.35 in the predicted (candidate) text, we use n = {1, 2, 3}. The
A U200 I120 10.00 15.83 22.50 20.00 20.83 10.00 10.00 higher ROUGE-n is, the better the quality of the candidate
A U2k I1k 7.89 15.25 12.85 12.62 12.62 7.54 7.54
A U20k I12k 6.34 13.99 12.79 12.38 12.37 6.29 6.29 text is. A good ROUGE-1 means that topics or vocabu-
A U210k I120k 6.25 14.04 14.40 13.32 13.31 6.22 6.22 lary are correctly caught while ROUGE-2 and ROUGE-3
are more representative of the user’s style.
Table 3: Test performance (classification error) as polarity A first baseline is given by using a random scoring function
classifiers. LL stands for LibLinear (SVM), µi , γu .γi , fA , h (instead of the formulation given in (11)). It provides a
fT are the recommender systems as in table 2. LL + fA lower bound of the performance. Three oracles are then
and LL + fT are two hybrid opinion classification models used to provide an upper bound on the performance. They
combining the SVM classifier and fA and fT recommender directly optimize the metrics ROUGE-n from the data on
systems. the test set.A matrix factorization baseline is also used. It is
a special case of our model where no text information is used.
This model computes a similar score for all the sentences of
3.2 Recommender system evaluation a given user and relative to an item. When one sentence
Let us first consider the evaluation of the rating predic- only is selected, it is taken at random among the sentences
tion. The metric used here is the mean squared error (MSE) of this user for the item. With greedy selection, the first
between rating predictions r̂ui and actual ratings rui . The sentence is chosen at random and then the cosine diversity
lower the MSE is, the better the model is able to estimate term (algorithm 1) allows a ranking of the next candidate
the correspondence between user tastes and items. Results sentences. Our proposed method is evaluated with the two
are presented in table 2. different user profile πu representation (auto-encoder and
The models are referenced using the notations introduced raw text). The performance of these seven models on the
in section 2.2. The first column corresponds to a trivial the two biggest datasets with respect to the three metrics
system which predicts µ the overall bias, the second predicts are aggregated in figure 2.
the user bias µu . Both give poor performance as expected. An histogram corresponds to a text selection entity (whole
The third column corresponds to the item bias µi base- review text, best single sentence, greedy sentence selection.
line. It assumes that user taste is not relevant and that each Groups in the histograms (respectively row block of the ta-
item has its own intrinsic quality. The improvement with bles) are composed of three cells corresponding respectively
respect to µ and µu is important since MSE is halved. The to the ROUGE-1, -2, -3 metrics. Not surprisingly, the results
fourth column corresponds to a nonnegative matrix factor- for the single sentence selection procedure (1S) are always
ization baseline, denoted γu .γi . It jointly computes latent worse than for the other two (CT: complete review and XS:
representations for user tastes and items characteristics. Un- multiple sentences). This is simply because a sentence con-
surprisingly, it is our best baseline. tains fewer words than a full review and it can hardly share
It could be noted that performance tends to degrade when more n-grams than the full text with the reference text. For
the subset size increases. This is a side effect associated to the ratebeer.com datasets, selecting a set of sentences clearly
the review selection process used for building the different offers a better performance than selecting a whole review in
datasets. Smaller datasets contain the most active users and all cases. Texts written to describe beers also describe the
the most commented items. The estimation of their profiles tasting experience. Was it in a bar or at home ? Was it a
benefits from the high number of reviews per user (and item) bottle or on tap ? Texts of the community share the same
in this context. structure and vocabulary to describe both the tasting and
The last two columns refer to our hybrid recommender the flavors of the beer. Most users write short and precise
systems, using the two text representations introduced in sentences. This is an appropriate context for our sentence
section 2.2. Both fA (autoencoder) and fT (raw text) per- scoring model, where the habits of users are caught by our
form better than a baseline collaborative filtering system recommender systems. The performance slightly decreases
and both have similar approximation errors. The main dif- when the size of the dataset is increased. As before, this is
ference between the systems comes from the complexity of in accordance with the selection procedure of these datasets
the approach: during the learning step, fT is much faster which focuses first on the most active users and commented
than fA given the fact that no autoencoder optimization is items. For Amazon, the conclusion is not so clear and de-
required. On top of that, fT remains faster in the inference pending on the conditions, either whole reviews or selected
0.6
CT 1S XS 0.6 CT 1S XS 0.10
0.15

0.5 0.5
0.08

0.4 0.4
0.10

A_U210k_I120k
0.06
RB_U5k_I20k

0.3 0.3

0.04
0.2 0.05 0.2

0.02
0.1 0.1

0.0 0.00 0.0 0.00
RO ndom

RO ndom

RO ndom
UG 2

UG 2

UG 2
ROUGE-1

ROUGE-1

ROUGE-1
NEM-3

NEM-3

NEM-3
F

F
f_A

f_A

f_A
f_T

f_T

f_T
ROUGE-

ROUGE-

ROUGE-
Ra

Ra
ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-1 ROUGE-2 ROUGE-3

(a) RateBeer experiments (b) Amazon experiments

Figure 2: Histograms of the performances of the summarizer on the two biggest datasets. The scores of the ROUGE-1
metrics are represented in blue while the scores of the ROUGE-2 and ROUGE-3 metrics are in yellow and black. 7 models
are compared: random, 3 oracles, NMF based model, fA and fT based models. 3 frameworks are investigated: CT (review
extraction), 1S (One sentence extraction), XS (Multiple sentence extraction).

sentences get the best score. It is linked to the higher variety problem formulation and the context of short product re-
in the community of users on the website: well structured views, ROUGE-2,3 are clearly too constraining and the cor-
sentences like those present in RateBeer are here mixed here responding scores are not significant.
with different levels of English and troll reviews.
The different models, overall, are following a clear hierar- 3.4 Sentiment classification evaluation
chy. First, stating the obvious, the random model has the The performance of the different models, using the sen-
worst performance. Then, using a recommender system to timent classification error as an evaluation metric, are pre-
select relevant sentences helps in terms of ROUGE-n per- sented in table 3. Because they give very poor performance,
formance. Using the text information brings most of the the bias recommendation models (µ and µu ) are not pre-
time only a small score improvement. Overall our models sented here. The item bias µi , second column, gives a base-
only offer small improvements here with respect to random line, which is improved by the matrix factorization γu .γi ,
or NMF text selection (i.e. based on rating similarity only). third column. Our hybrid models fA , fourth column, and
After analyzing this behavior, we believe that this is due to fT , fifth column, have lower classification errors than all
the shortness of the text reviews, to their relatively stan- the other recommender systems. The first column, LL is
dardized form (arguments are very similar from one review the linear support vector machine (SVM) baseline. It has
to another), to the peaked vocabulary distribution of the re- been learnt on the training set texts, and the regularization
views, and to the nature of ROUGE. The latter is a classical hyperparameter has been selected using the validation set.
recall oriented summarization evaluation measure, but does Our implementation relies on liblinear (LL) [4].
not distinguishes well between text candidates in this con- Its performance is better than the recommender systems
text. This also shows that there is room for improvement but it should be noted that it makes use of the actual text
on this aspect. dui of the review, whereas the recommender systems only
Concerning the oracle several conclusions can be drawn. use past information regarding user u and item i. Note
For both single sentence and complete text selection, the gap that even in this context, the recommender performance on
between the ROUGE measures and the proposed selection RateBeer is very close to the SVM baseline.
method is important suggesting that there is still room for It is then possible to combine the two models, according
improvements here too. For the greedy sentence selection, to the formulation proposed in section 2.4. The resulting
the gap between the oracles and the hybrid recommender hybrid approaches, denoted LL + fA and LL + fT , exploit
systems is moderate suggesting that the procedure is here both text based decision (SVM) and user profile (fA and fT ).
fully efficient. However this conclusion should be moderated. This combined model shows good classification performance
It can be observed that whereas, ROUGE is effectively an and overcomes the LL baseline in 4 out of 7 experiments
upper bound for single sentence or whole review selection, in table 3, while performing similarly to LL in the other 3
this is no more the case for multiple sentences selection. experiments.
Because of the complexity of selecting the best subset of
sentences according to a loss criterion (which amounts at a
combinatorial selection problem) we have been using a sub- 4. OVERALL GAINS
optimal forward selection procedure: we first select the best In order to get a global vision of the overall gain provided
ROUGE sentence, then the second best, etc. In this case by the proposed approach, we summarize here the results
the ROUGE procedure is no more optimal. obtained on the different tasks. For each task, the gain with
Concerning the measures, the performance decreases rapidly respect to the (task dependent) baseline is computed and
when we move from ROUGE-1 to ROUGE-2, 3. Given the averaged (per task) over all datasets. The metric depends
on the task. Results are presented in figure 3.
10 70 10
Gain in % w.r.t. random on rouge-n
0 60 0

10 10
50
20 20
40
30 30
30
40 40
20
50 50

60 10 60
Gain in % w.r.t. MSE of γu .γi Gain in % w.r.t. % Good classification of LL
70 0 70
µ µu µi fA fT γu .γi fA fT rouge-1 rouge-2 rouge-3 µi γu .γi fA fT LL + fA LL + fT

(a) Recommender systems. (b) Summarizers. (c) Opinion classifiers.
Baseline=matrix factorization Baseline=random selection procedure Baseline=SVM

Figure 3: Aggregated gains on the 3 tasks w.r.t. classic baselines: our hybrid recommender systems are better overall.

For the mean squared error metric (figure 3a) the matrix laborative filtering. Given the focus of this work on con-
factorization is used as baseline. The user bias µu heavily sumer reviews, we considered collaborative filtering. For
fails to generalize on two datasets. The item bias is closer merchant websites the goal is to encourage users to buy new
to the baseline (−11.43%). Our hybrid models, which uses products and the problem is usually considered either as the
texts to refine user and item profiles bring a gain of 5.71% prediction of a ranked list of relevant items for each user [13]
for fA , 5.63% for fT . This demonstrates the interest of or as the completion of missing ratings [9]. We have focused
including textual information in the recommender system. here on the latter approach for evaluation concerns: since
Autoencoder and raw text approaches offer similar gains, we use data collected from third party sources.
the latter approach being overall faster.
For the text generation, we take the random model as 5.2 Text summarization for consumer reviews
baseline and results are presented in figure 3b. The gain is Early reference work [7] on consumer reviews has focused
computed for the three investigated framework (CT: review on global summarization of user reviews for each item. The
selection, 1S: one sentence selection, XS: multiple sentence motivation of this work was to extract the sentiments associ-
selection) and per measure (ROUGE-1, 2, 3) and then av- ated to a list of features from all the item review texts. The
eraged to one overall gain. ROUGE-n oracles clearly out- summarization took the form of a rating or of an apprecia-
perform other models, which seems intuitive. The differ- tion of each feature. Here, contrarily to this line of work, the
ent recommender systems have very close behaviors with focus is on personalized item summaries for a target user.
respective gains of 11.15% (matrix factorization), 11.89% Given the difficulty of producing a comprehensive synthetic
(auto-encoder), 11.83% (raw text). Here textual informa- summary, we have turned this problem into a sentence or
tion helps but does not clearly dominate ratings and pro- text selection process.
vide only a small improvement. Remember that although Evaluation of summaries is challenging: how to assess the
performance improvement with respect to baselines is desir- quality of a summary when the ground truth is subjective?
able, the main novelty of the approach here is to propose In our context, the review texts are available and we used
a personalized summary generation together with the usual them as the ground truth. We have used classical ROUGE-n
rating prediction. summary evaluation measures [10].
For the opinion classifier, presented in figure 3c, the base-
line consists in a linear SVM. Basic recommender systems 5.3 Sentiment classification
perform poorly with respect to the baseline (LL). Surpris- Different text latent representations have been proposed
ingly, the item bias µi (−68.71%) performs slightly better in this scope: [14] proposed a generative model to represent
than matrix factorization γu .γi (−69.54%) in the context of jointly topic and sentiments and recently, several works have
sentiment classification (no neutral reviews and binary rat- considered matrix factorization or neural network, in an at-
ings). Using textual information increases the performance. tempts to develop robust sentiment recognition systems [6].
The autoencoder based model fA (−57.17%) and raw text [16] go further and propose to learn two types of represen-
approach fT (−58.31%) perform similarly. As discussed in tation: a vectorial model is learned for word representation
3.4, the linear SVM uses the text of the current reviews when together with a latent transformation model, which allows
the recommender systems does not. As a consequence, it is the representation of negation and quantifiers associated to
worth combining both predictions in order to exploit text an expression.
and past profiles: the resulting models give respective gains We have investigated two kinds of representation for the
of 4.72% (autoencoder) and 3.89% (raw text) w.r.t the SVM. texts: bag of words and a latent representation through the
use of autoencoders as in [6]. [11] also uses a latent represen-
5. RELATED WORK tation for representing reviews, although in a probabilistic
Since the paper covers the topics of rating prediction, sum- setting instead in a deterministic one like we are doing here.
marization and sentiment classification, we briefly present
each of them. 5.4 Hybrid approaches
In the field of recommendation, a first hybrid model was
5.1 Recommender systems proposed by [5]: it is based on hand labeling of review sen-
Three main families of recommendation algorithms have tences (topic and polarity) to identify relevant character-
been developed [3]: content-based knowledge-based, and col- istics of the items. [11] pushes further the exploitation of
texts, by using a joint latent representation for ratings and representations used in the different components are more
textual content with the objective of improving the rating closely correlated than in the present model.
accuracy. These two works are focused on rating prediction Acknoledgements The authors would like to thank the
and do not consider delivering additional information to the AMMICO project (F1302017 Q - FUI AAP 13) for funding
user. Very recently, [19] has considered adding an explana- our research.
tion component to a recommender system. For that, they
propose to extract some keywords from the review texts, 7. REFERENCES
which are supposed to explain why a user likes or dislikes
[1] D Agarwal, BC Chen, and B Pang. Personalized
an item. This is probably the work whose spirit is closest to
recommendation of user comments via factor models.
ours but they do not provide a quantitative evaluation.
EMNLP’11, 2011.
[7] combined opinion mining and text summarization on
product reviews with the goal of extracting the qualities and [2] M Amini and N Usunier. A contextual query
defaults. [17] proposed a system for delivering personalized expansion approach by term clustering for robust text
answers to user queries on specific products. They built the summarization. DUC’07, 2007.
user profiles relying on topic modeling without any senti- [3] R Burke. Hybrid recommender systems: Survey and
ment dimension. [1] proposed a personalized news recom- experiments. UMUAI’02, 2002.
mendation algorithm evaluated on the Yahoo portal using [4] R-E Fan, K-W Chang, C-J Hsieh, X-R Wang, and C-J
user feedback, but it does not investigate ratings or sum- Lin. Liblinear: A library for large linear classification.
marization issues. Overall, we propose in this article to go JMLR’08, 2008.
beyond a generic summary of item characteristics by gener- [5] G Ganu, N Elhadad, and A Marian. Beyond the Stars:
ating for each user a personalized summaries that is close to Improving Rating Predictions using Review Text
what they would have written about the item themselves. Content. WebDB’09, 2009.
For a long time, sentiment classification has ignored the [6] X Glorot, A Bordes, and Y Bengio. Domain
user dimension and has focused for example on the concep- adaptation for large-scale sentiment classification: A
tion of ”universal” sentiment classifiers able to deal with a deep learning approach. In ICML’11, 2011.
large variety of topics [15]. Considering the user has become [7] Minqing Hu and Bing Liu. Mining and summarizing
an issue only very recently. [18] for example exploited ex- customer reviews. KDD ’04, page 168, 2004.
plicit relations in social graphs for improving opinion clas- [8] N Jindal and B Liu. Opinion spam and analysis. In
sifiers, but their work is only focused on this aspect. [12] WSDM, pages 219–230. ACM, 2008.
proposed to distinguish different rating behaviors and show [9] Yehuda Koren, Robert Bell, and Chris Volinsky.
that modeling the review authors in a scale ranging from Matrix factorization techniques for recommender
connoisseur to expert offers a significant gain for an opinion systems. Computer, pages 42–49, 2009.
prediction task. [10] Chin-Yew Lin. Rouge: A package for automatic
In our work, we have experimented the benefits of con- evaluation of summaries. In ACL Workshop: Text
sidering the text of user reviews in recommender system for Summarization Branches Out, 2004.
their performance as sentiment classifier. We have addition-
[11] J McAuley and J Leskovec. Hidden factors and hidden
ally proposed, as a secondary contribution, an original model
topics: understanding rating dimensions with review
mixing recommender systems and linear classification.
text. RecSys’13, 2013.
[12] JJ McAuley and J Leskovec. From amateurs to
6. CONCLUSION connoisseurs: modeling the evolution of user expertise
This article proposes an extended framework to the rec- through online reviews. WWW’13, 2013.
ommendation task. The general goal is to enrich classical [13] Matthew R McLaughlin and Jonathan L Herlocker. A
recommender systems with several dimensions. As an ex- Collaborative Filtering Algorithm and Evaluation
ample we show how to generate personalized reviews for Metric That Accurately Model the User Experience.
each recommendation using extracted summaries. This is In SIGIR’04, 2004.
our main contribution. We also show how rating and text [14] Q Mei, X Ling, M Wondra, H Su, and CX Zhai. Topic
could be used to produce efficient personalized sentiment sentiment mixture: modeling facets and opinions in
classifiers for each recommendation. Depending on the ap- weblogs. In WWW. ACM, 2007.
plication, other additional information could be brought to [15] B Pang and L Lee. Opinion mining and sentiment
the user. Besides producing additional information for the analysis. Information Retrieval, 2008.
user, the different information sources can take benefit one
[16] R Socher, B Huval, CD Manning, and A Ng. Semantic
from the other. We thus show how to effectively make use
compositionality through recursive matrix-vector
of text review and rating informations for building improved
spaces. In EMNLP’12. ACL, 2012.
rating predictors and review summaries. As already men-
tioned, the sentiment classifiers also benefits from the two in- [17] C Tan, E Gabrilovich, and B Pang. To each his own:
formation sources. This part of the work demonstrates that personalized content selection based on text
multiple information sources could be useful for improving comprehensibility. In ICWDM’12. ACM, 2012.
recommendation systems. This is particularly interesting [18] C Tan, L Lee, J Tang, L Jiang, M Zhou, and P Li.
since several sources are effectively available now at many User-level sentiment analysis incorporating social
online sites. Several new applications could be developed networks. In KDD’11. ACM, 2011.
along the lines suggested here. From a modeling point of [19] Y Zhang, G Lai, M Zhang, Y Zhang, Y Liu, and S Ma.
view, more sophisticated approaches can be developed. We Explicit factor models for explainable recommendation
are currently working on a multitask framework where the based on phrase-level sentiment analysis. 2014.