=Paper= {{Paper |id=Vol-1448/paper7 |storemode=property |title=Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary |pdfUrl=https://ceur-ws.org/Vol-1448/paper7.pdf |volume=Vol-1448 |dblpUrl=https://dblp.org/rec/conf/recsys/PoussevinGG15 }} ==Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary== https://ceur-ws.org/Vol-1448/paper7.pdf
     Extended Recommendation Framework: Generating the
       Text of a User Review as a Personalized Summary

                 Mickaël Poussevin                               Vincent Guigue                                 Patrick Gallinari
            Sorbonne-Universités UPMC                     Sorbonne-Universités UPMC                     Sorbonne-Universités UPMC
              LIP6 UMR 7606 CNRS                            LIP6 UMR 7606 CNRS                            LIP6 UMR 7606 CNRS
           4 Place Jussieu, Paris, France                4 Place Jussieu, Paris, France                4 Place Jussieu, Paris, France
           mickael.poussevin@lip6.fr                       vincent.guigue@lip6.fr                        patrick.gallinari@lip6.fr
                                                                            Inputs                                   Legend
ABSTRACT
                                                                                                                           User              Rating    Item      Text
We propose to augment rating based recommender systems                       User reviews:
by providing the user with additional information which                     text and rating
might help him in his choice or in the understanding of the
                                                                            Classic recommender systems
recommendation. We consider here as a new task, the gen-                                                            User text profiles           Item text profiles
eration of personalized reviews associated to items. We use                 U    Items            Latent profiles
an extractive summary formulation for generating these re-                  s
views. We also show that the two information sources, rat-                  e              =       x
                                                                            r
ings and items could be used both for estimating ratings and                s                           Improving
for generating summaries, leading to improved performance                                           rating predictions
for each system compared to the use of a single source. Be-
sides these two contributions, we show how a personalized                                                                                       Personnalized
polarity classifier can integrate the rating and textual as-                                  ?                                                    reviews
                                                                                                                                         ?
                                                                                     Rating prediction                                            summary
pects. Overall, the proposed system offers the user three per-
sonalized hints for a recommendation: rating, text and po-
larity. We evaluate these three components on two datasets                Figure 1: Our contribution is twofold: (1) improving rating
using appropriate measures for each task.                                 predictions using textual information, (2) generating per-
                                                                          sonalized reviews summaries to push recommender systems
                                                                          beyond rating predictions
1.    INTRODUCTION
   The emergence of the participative web has enabled users
to easily give their sentiments on many different topics. This
opinionated data flow thus grows rapidly and offers oppor-                diction, some authors have make use of additional informa-
tunities for several applications like e-reputation manage-               tion sources available on typical e-commerce sites. [5] pro-
ment or recommendation. Today many e-commerce web-                        posed to extract topics from consumer reviews in order to
sites present each item available on their platform with a                improve ratings predictions. Recently, [11] proposed to learn
description of its characteristics, average appreciation, rat-            a latent space common to both textual reviews and product
ings together with individual user reviews explaining their               ratings, they showed that rating prediction was improved
ratings.                                                                  by such hybrid recommender systems. Concerning the in-
   Our focus here is on user - item recommendation. This is a             formation provided to the user, some models exploit review
multifaceted task where different information sources about               texts for ranking comments that users may like [1] or for
users and items could be considered and different recommen-               answering specific user queries [17].
dation information could be provided to the user. Despite                    We start here from the perspective of predicting user pref-
this diversity, the academic literature on recommender sys-               erence and argue that the exploitation of the information
tems has focused only on a few specific tasks. The most                   present in many e-commerce sites, allows us to go beyond
popular one is certainly the prediction of user preferences               simple rating prediction for presenting users with comple-
given their past rating profile. These systems typically rely             mentary information that may help him making his choice.
on collaborative filtering [9] to predict missing values in a             We consider as an example the generation of a personalized
user/item/rating matrix. In this perspective of rating pre-               review accompanying each item recommendation. Such a
                                                                          review is a source of complementary evidence for the user
                                                                          appreciation of a suggestion. Similarly as it is done for the
                                                                          ratings, we exploit past information and user similarity in
                                                                          order to generate these reviews. Since pure text generation
                                                                          is a very challenging task [2], we adopt an extractive sum-
                                                                          mary perspective: the generated text accompanying each
                                                                          rating will be extracted from the reviews of selected users
                                                                          who share similar tastes and appreciations with the target
CBRecSys 2015, September 20, 2015, Vienna, Austria.                       user. Ratings and reviews being correlated, this aspect could
Copyright remains with the authors and/or original copyright holders.     also be exploited to improve the predictions. Our rating pre-
dictor will make use of user textual profiles extracted from        κui sentences: dui = {suik , 1 ≤ k ≤ κui }. In this work,
their reviews and summary extraction in turn will use pre-          we consider documents as bags of sentences. To simplify
dicted ratings. Thus both types of information, predicted           notations, suik is replaced by sui when there is no ambigu-
ratings and generated text reviews, are offered to the user         ity. Thus, user appreciations are quadruplets (u, i, rui , dui ).
and each prediction, rating and generated text, takes into          Recommender systems use past information to compute a
account the two sources of information. Additional infor-           rating prediction r̂ui , the corresponding prediction function
mation could also be provided to the user. We show here             is denoted f (u, i).
as an example, that predicted ratings and review texts can             For the experiments, ratings and text reviews are split
be used to train a robust sentiment classifier which provides       into training, validation and test sets respectively denoted
the user with a personalized polarity indication about the          Strain , Sval and Stest and containing mtrain , mval and mtest
                                                                                                                         (u)
item. The modules of our system are evaluated on the two            user appreciations (text and rating). We denote Strain , the
main tasks, rating prediction and summary extraction, and           subset of all reviews Strain that were written by user u and
on the secondary task of sentiment prediction. For this,               (u)                                               (i)
                                                                    mtrain the number of such reviews. Similarly, Strain and
experiments are conducted on real datasets collected from              (i)
                                                                    mtrain are used for the reviews on item i.
amazon.com and ratebeer.com and models are compared to
classical baselines.                                                2.2     Hybrid recommender system with text pro-
   The recommender system is compared to a classic collab-
orative filtering model using the mean squared error metric.
                                                                            files
We show that using both ratings and user textual profiles              Recommender systems classically use rating history to
allows us to improve the performance of a baseline recom-           predict the rating r̂ui that user u will give to item i. The
mender. Gains are motivated from a more precise under-              hybrid system described here makes use of both collaborative
standing of the key aspects and opinions included in the            filtering through matrix factorization and textual informa-
item and user textual profiles. For evaluating summary text         tion to produce a rating as described in (1):
generation associated to a couple (user, item), we have at
our disposal a gold standard, the very review text written
                                                                              f (u, i)     = µ + µu + µi + γu .γi + g(u, i)         (1)
by this user on the item. Note that this is a rare situation
in summary evaluation. However contrarily to collaborative             The first three predictors in equation (1) are biases (over-
filtering, there is no consensual baseline. We then compare         all bias, user bias and item bias). The fourth predictor is
our results to a random model and to oracle optimizing the          a classical matrix factorization term. The novelty of our
ROUGE-n metric. They respectively provide a lower and               model comes from the fifth term (1) that takes into account
an upper bound of the attainable performance. The sen-              text profiles to refine the prediction f . Our aim for the
timent classifier is classically evaluated using classification     rating prediction is to minimize the following empirical loss
accuracy.                                                           function:
   This article is organized as follows. The hybrid formu-
                                                                                                   1     X
lation, the review generator and the sentiment classifier are                argmin L =                       (rui − f (u, i))2 (2)
presented in section 2. Then, section 3 gives an extensive ex-            µ,µu ,µi ,γu ,γi ,g   mtrain S
                                                                                                             train
perimental evaluation of the framework. The overall gains
associated to hybrid models are discussed in section 4. A           To simplify the learning procedure, we first optimize the pa-
review of related work is provided in section 5.                    rameters of the different components independently as de-
                                                                    scribed in the following subsections. Then we fine tune the
                                                                    combination of these components by learning weighting co-
2.    MODELS                                                        efficients so as to maximize the performance criterion (2) on
   In this section, after introducing the notations used through-   the validation set.
out the paper, we will describe successively the three mod-
ules of our system. We start by considering the prediction          2.2.1    Matrix factorization
of ratings [11]. Rating predictors answer the following ques-          We first compute the different bias from eq. (1) as the
tion: what rating will this user give to this item? We present      averaged ratings over their respective domains (overall, user
a simple and efficient way to introduce text profiles repre-        and item). For the matrix factorization term, we approxi-
senting the writing style and taste of the user in a hybrid         mate the rating matrix RU I using two latent factors: RU I ≈
formulation. We then show how to exploit reviews and rat-           ΓU ΓTI . Both ΓU and ΓI are two matrices representing collec-
ings in a new challenging task: what text will this user write      tions of latent profiles, with one profile per row. We denote
about this item? We propose an extractive summary formu-            γu (resp. γi ) the row of ΓU (resp. ΓI ) corresponding to the
lation of this task. We then proceed to describe how both           latent profile of user u (resp. item i).
ratings and text could be used together in a personalized              The profiles are learned by minimizing, on the training set,
sentiment classifier.                                               the mean squared error between known ratings in matrix
                                                                    RU I and the approximation provided by the factorization
2.1    Notations                                                    ΓU ΓTI . This minimization problem described in equation
  We use u (respectively i) to refer to everything related to       (3), with an additional L2 constraint (4) on the factors is
a user (respectively to an item) and the rating given by user       solved here using non-negative matrix factorization.
u to the item i is denoted rui . U and I refer to anything
relative to all users and all items, such as the rating matrix
RU I . Similarly, lower case letters are used for scalars or vec-         Γ∗U , Γ∗I = argmin       kMtrain     (RU I − ΓU ΓI )k2F   (3)
                                                                                         ΓU ,ΓI
tors and upper case letters for matrices. dui is the actual
review text written by user u for item i. It is composed of                                        +λU kΓU k2F + λI kΓI k2F         (4)
  In this equation Mtrain is a binary mask that has the                  computational load. All sentences are represented as binary
same dimensions as matrix RU I , an entry is 1 only if the               bag of words using this dictionary. The coding dimension
corresponding review is in the training set, is the element-             has been set to 1000 after a few evaluation trials. Note that
wise product and k · kF denotes the Frobenius norm.                      the precise value of this latent space is not important and the
                                                                         performance is similar on a large range of dimension values.
2.2.2    Text profiles exploitation                                                                                          1
                                                                         Both cod and dec use sigmoid units sig(t) = 1+exp(−t)     :
   Let us denote πu the text profile of user u and σt (πu0 , πu )
a similarity operator between user profiles. The last compo-                           cod(suik ) = πuik = sig(W suik + b)
                                                                                                                                           (8)
nent of the predictor f in (1) is a weighted average of user                           dec(πuik ) = sig(W T πuik + b0 )
ratings for item i, where weight σt (πu0 , πu ) is the similarity          Here, πuik is a vector, W is a 5000x1000 weight matrix
between the text profiles πu0 and πu of users u0 and u, the              and sig() is a pointwise sigmoid operator operating on the
latter being the target user. This term takes into account               vector W suik + b.
the fact that two users with similar styles or using similar ex-           As motivated in [11, 5], such a latent representation helps
pressions in their appreciation of an item, should share close           exploiting term co-occurrences and thus introduces some se-
ratings on this item. The prediction term for the user/item              mantic. It provides a robust text representation. The hidden
couple (u, i) is then expressed as a weighted mean:                      activity of this neural network produces a continuous rep-
                                                                         resentation for each sentence accounting for the presence or
                            1       X
                                                                         absence of groups of words.
            g(u, i) =     (i)
                                            ru0 i σt (πu0 , πu )   (5)
                         mtrain (i)                                        πu is obtained by coding the vector corresponding to all
                               S    train                                text written by the user u in the past. It lies in a latent
   Two different representations for the text profiles πu of             word space where a low Euclidean distance between users
the users are investigated in this article: one is based on              means a similar usage of words. Thus, for the similarity σt ,
a latent representation of the texts obtained by a neural                we use an inverse Euclidean distance in the latent space:
network autoencoder, the other relies on a robust bag of                                σt (πu0 , πu ) = 1/(α + kπu0 − πu k)               (9)
words coding. Each one is associated to a dedicated metric
σt .                                                                     2.2.3     Global training criterion for ratings prediction
   This leads to two formulations of g, and thus, to two rating            In order to connect all the elementary components de-
prediction models. We denote the former fA (autoencoder)                 scribed above with respect to our recommendation task, we
and the latter fT (bag of words). Details are provided below.            introduce (positive) weighting parameters β in (1). Thus,
                                                                         the initial optimization problem (2) becomes:
Bag of words.
                                                                          β ∗ = argmin mtrain
                                                                                          1
                                                                                               P
  A preprocessing step removes all words appearing in less                                       Strain
                                                                                    β
than 10 documents. Then, the 100 000 most frequent words                                                                           2 (10)
are kept. Although the number of features is large, the rep-                rui − β1 µ∗ + β2 µ∗u + β3 µ∗i + β4 γu∗ .γi∗ + β5 g(u, i)
resentation is sparse and scales well. πu is simply the binary
bag of words of all texts of user u. In this high dimensional              The linear combination is optimized using a validation set:
space, the proximity in style between two users is well de-              this step guaranties that all components are combined in an
scribed by a cosine function, a high value indicates similar             optimal manner.
usage of words:                                                          2.3     Text generation model
              σt (πu0 , πu ) = πu0 πu /(kπu0 kkπu k)               (6)      The goal here is to generate a review text for each (u,i)
                                                                         recommendation. During the recommendation process, this
                                                                         text is an additional information for users to consider. It
Autoencoder.                                                             should catch their interest and in principle be close to the
   The neural network autoencoder has two components: a                  one that user u could have written himself on item i. Each
coding operator and a decoding operator denoted respec-                  text is generated as an extractive summary, where the ex-
tively cod and dec. The two vectorial operators are learned              tracted sentences su0 i come from the reviews written by
so as to enable the reconstruction of the original text after            other users (u0 6= u) about item i. Sentence selection is
a projection in the latent space. Namely, given a sentence               performed according to a criterion which combines a simi-
suik represented as a binary bag of words vector, we obtain              larity between the sentence and the textual user profile and
a latent profile πsuik = cod(suik ) and then, we reconstruct             a similarity between the actual rating ru0 i and the predic-
an approximation of the sentence using ŝuik = dec(πsuik ).              tion made for (u,i), r̂ui computed as described in section 2.2.
   The autoencoder is optimized so as to minimize the re-                The former measure could take into account several dimen-
construction error over the training set:                                sions like vocabulary, sentiment expression and even style,
                                     κui                                 here it is mainly the vocabulary which is exploited. The
                          X        1 X
cod∗ , dec∗ = argmin                   ksuik − dec(cod(suik ))k2         latter measures the proximity between user tastes. For the
               cod,dec
                         Strain
                                  κui                                    text measure, we make use of the σt similarity introduced
                                     k=1
                                                          (7)            in section 2.2. As before, we will consider two representa-
  We use the settings proposed in [6]: our dictionary is ob-             tions for texts (latent coding and raw bag of words). For the
tained after stopwords removal and selecting the most fre-               ratings similarity, we use σr (ru0 i , rui ) = 1/(1 + |ru0 i − rui |).
quent 5000 words. we did not use a larger dictionary such as                Suppose one wants to select a single sentence for the ex-
the one used for the bag of word representation since it does            tracted summary. The sentence selection criterion will then
not lead to improved performance and simply increases the                be a simple average of the two similarities:
                                                                                     Source       Subset names   #Users       #Items                #Reviews
                                                                                                                                       #Training    #Validation    #Test
                                                                                                  RB U50 I200       52          200        7200         900          906




                                                                                             r
                                     σt (su0 i , πu ) + σr (ru0 i , r̂ui )




                                                                                          ee
      h(su0 i , ru0 i , u0 , u, i) =                                                              RB U500 I2k      520         2000      388200       48525         48533




                                                                                       eb
                                                                             (11)




                                                                                     at
                                                      2                                           RB U5k I20k      5200       20000     1887608       235951       235960




                                                                                    R
                                                                                                  A U200 I120      213          122         984         123          130




                                                                                           on
   Note that this function may score any piece of text. In                                         A U2k I1k       2135        1225       31528        3941         3946




                                                                                         az
                                                                                                  A U20k I12k     21353       12253      334256       41782         41791




                                                                                     m
the following, we then consider three possibilities for gener-




                                                                                    A
                                                                                                 A U210k I120k   213536       122538    1580576       197572       197574
ating text reviews: The first one simply consists in selecting
the best sentence su0 i among all the training sentences for                        Table 1: Users, items & reviews counts for every datasets.
item i with respect to h. We call it 1S for single sentence.                         Subsets                  µ       µu           µi      γu .γi       fA          fT
The second one selects a whole review du0 i among all the                            RB U50 I200           0.7476   0.7291       0.3096   0.2832      0.2772      0.2773
                                                                                     RB U500 I2k           0.6536   0.6074       0.3359   0.3168      0.3051      0.3051
reviews for i. The document is here considered as one long                           RB U5k I20k           0.7559   0.6640       0.3912   0.3555      0.3451      0.3451
sentence. This is denoted CT for complete text. The third                            A U200 I120           1.5348   2.0523       1.6563   1.7081      1.4665      1.4745
one is a greedy procedure that selects multiple sentences, it                        A U2k I1k             1.5316   1.4391       1.3116   1.0927      1.0483      1.0485
is denoted XS. It is initialized with 1S, and then sentences                         A U20k I12k           1.4711   1.4241       1.2849   1.0797      1.0426      1.0426
                                                                                     A U210k I120k         1.5072   2.1154       1.5318   1.2915      1.1671      1.1678
are selected under two criteria: relevance with respect to h
and diversity with respect to the sentences already selected.
                                                                                    Table 2: Test performance (mean squared error) for rec-
Selection is stopped when the length of the text is greater
                                                                                    ommendation. µ, µu , µi are the overall bias, user bias and
than the average length of the texts of the target user. Al-
                                                                                    item bias baselines. γu .γi is the plain matrix factorization
gorithm 1 sums up the XS procedure for generating the text
                                                                                    baseline. fA , fT are our hybrid recommender systems rely-
dˆui for the couple user u, item i.
                                                                                    ing respectively on latent and raw text representations. The
                                                                                    different datasets are described in table 1.
  Data: u, i, S = {(su0 i , ru0 i u0 }
  Result: dˆui
  s∗u0 i ← argmax h(su0 i , ru0 i , u0 , u, i) ;
                                              
                                                                                    lem:
           su0 i ∈S                                                                                                       
                                                                                                             X                                   
  dˆui ← s∗u0 i ;                                                                   w∗ = argmin                               1− dui .w+f (u, i) cui +λkwk2
                                                                                                                                                            +
  Remove s∗u0 i from S;                                                                           w
                                                                                                        Strain ,rui 6=3

  while length dˆui < averagelength(u) do                                                                                                  (12)
    s∗u0 i ← argmax h(su0 i , ru0 i , u0 , u, i) − cos(su0 i , dˆui ) ;
                                                                                   with (x)+ = x when x positive and (x)+ = 0 elsewhere. In
                    su0 i ∈S                                                        the experimental section, we will also compare the results
      dˆui ← s∗u0 i ;                                                               obtained with the two versions of our rating predictor: fT
      Remove s∗u0 i from S;                                                         and fA (cf section 2.2.2).
  end
 Algorithm 1: XS greedy procedure: selection of successive                          3.      EXPERIMENTS
 sentences to maximize both relevance and diversity. dˆui is
                                                                                       All three modules, ratings, text, sentiments, are evaluated
 the text that is generated, sentence after sentence.
                                                                                    independently since there is no global evaluation framework.
                                                                                    These individual performances should however provide to-
                                                                                    gether a quantitative appreciation of the whole system.
2.4       Sentiment prediction model                                                   We use two real world datasets of user reviews, collected
   We show here how polarity information about an item can                          from amazon.com [8] and ratebeer.com [11]. Their charac-
be estimated by exploiting both the user predicted ratings                          teristics are presented in table 1.
and his textual profile. Exploiting both information sources                           Below, one presents first how datasets are preprocessed
improves the sentiment prediction performance compared                              in 3.1. The benefits of incorporating the text in the ratings
with a usual text based sentiment classifier.                                       prediction for the recommender system are then discussed in
   Polarity classification is the task of predicting whether a                      section 3.2. The quality of the generated reviews is evaluated
text dui (here of a review) is positive or negative. We use as                      and analyzed in section 3.3 Finally, the performance of the
ground truth the ratings rui and follow a standard thresh-                          sentiment classifier combining text and ratings is described
olding procedure [15]: reviews rated 1 or 2 are considered                          in 3.4.
as negative, while items rated 4 or 5 are positive. All texts
that are rated 3 are ignored as it is unclear whether that                          3.1          Data preprocessing
are positive or negative: it strongly depends on the rating                           Reviews from different websites have different formats (rat-
habits of the user.                                                                 ing scales, multiple ratings, . . . ). We focus on the global
   For evaluation purpose, we consider two baselines. A first                       rating and scaled it to a 1 to 5 integer range. For titled
one only uses the rating prediction of our recommender sys-                         reviews, the title is considered as the first sentence of the
tem f (u, i) as a label prediction, this value is then thresh-                      text of the review. Each dataset is randomly split into three
olded as indicated above. A second one is a classical text                          parts: training, validation and test containing respectively
sentiment classifier. Denoting by dui the binary bag of word                        80%, 10% and 10% of the reviews.
representation of a document and cui the binary label associ-                         As described in 2.2, two representations of the text are
ated to the rating rui , one uses a linear SVM s(dui ) = dui .w.                    considered each with a different dictionary:
Note that this is usually a strong baseline for the polarity
classification task. Our final classifier will combine f (u, i)                          • for the autoencoder, we have selected the 5000 most
and s(dui ) in order to solve the following optimization prob-                             frequent words, with a stopwords removal step; The
      autoencoder input vector is then a binary vector of                     step: the inherent sparsity of the bag of word representation
      dimension 5000.                                                         enables fT to provide faster computations than fA . The au-
                                                                              toencoder works in a smaller dimensional space but it is not
   • for the raw representation, we have selected the 100000                  sparse.
     most frequent words appearing in more than 10 docu-
     ments (including stopwords) and used a binary vector
     representation.
                                                                              3.3    Text generation evaluation
                                                                                 We move on now to the evaluation of the personalized
   For the experiments, we consider several subsets of the                    review text generation module. Since we are using an ex-
databases with different numbers of users and items. Each                     tractive summary procedure, we make use of a classical loss
dataset is built by extracting, for a given number of users                   used for summarization systems: we use a recall-oriented
and items, the most active users and the most commented                       ROUGE-n metrics, by comparing the generated text against
items. Dataset characteristics are given in table 1.                          the actual text of the review produced by the user. As far as
                                                                              we know, generating candidate reviews has never been dealt
 Subsets           LL      µi    γu .γi     fA      fT    LL + fA   LL + fT   with in this context and this is a novel task. The ROUGE-n
 RB U50 I200      5.35    5.12    6.01     5.57    5.57     3.79      3.79
 RB U500 I2k      7.18   10.67    9.73     8.55    8.55     6.52      6.92    metric is the proportion of n-grams of the actual text found
 RB U5k I20k      8.44   11.80   10.04     9.17    9.17     8.33      8.35    in the predicted (candidate) text, we use n = {1, 2, 3}. The
 A U200 I120     10.00   15.83   22.50    20.00   20.83    10.00     10.00    higher ROUGE-n is, the better the quality of the candidate
 A U2k I1k        7.89   15.25   12.85    12.62   12.62     7.54      7.54
 A U20k I12k      6.34   13.99   12.79    12.38   12.37     6.29      6.29    text is. A good ROUGE-1 means that topics or vocabu-
 A U210k I120k    6.25   14.04   14.40    13.32   13.31     6.22      6.22    lary are correctly caught while ROUGE-2 and ROUGE-3
                                                                              are more representative of the user’s style.
Table 3: Test performance (classification error) as polarity                     A first baseline is given by using a random scoring function
classifiers. LL stands for LibLinear (SVM), µi , γu .γi , fA ,                h (instead of the formulation given in (11)). It provides a
fT are the recommender systems as in table 2. LL + fA                         lower bound of the performance. Three oracles are then
and LL + fT are two hybrid opinion classification models                      used to provide an upper bound on the performance. They
combining the SVM classifier and fA and fT recommender                        directly optimize the metrics ROUGE-n from the data on
systems.                                                                      the test set.A matrix factorization baseline is also used. It is
                                                                              a special case of our model where no text information is used.
                                                                              This model computes a similar score for all the sentences of
3.2    Recommender system evaluation                                          a given user and relative to an item. When one sentence
   Let us first consider the evaluation of the rating predic-                 only is selected, it is taken at random among the sentences
tion. The metric used here is the mean squared error (MSE)                    of this user for the item. With greedy selection, the first
between rating predictions r̂ui and actual ratings rui . The                  sentence is chosen at random and then the cosine diversity
lower the MSE is, the better the model is able to estimate                    term (algorithm 1) allows a ranking of the next candidate
the correspondence between user tastes and items. Results                     sentences. Our proposed method is evaluated with the two
are presented in table 2.                                                     different user profile πu representation (auto-encoder and
   The models are referenced using the notations introduced                   raw text). The performance of these seven models on the
in section 2.2. The first column corresponds to a trivial                     the two biggest datasets with respect to the three metrics
system which predicts µ the overall bias, the second predicts                 are aggregated in figure 2.
the user bias µu . Both give poor performance as expected.                       An histogram corresponds to a text selection entity (whole
   The third column corresponds to the item bias µi base-                     review text, best single sentence, greedy sentence selection.
line. It assumes that user taste is not relevant and that each                Groups in the histograms (respectively row block of the ta-
item has its own intrinsic quality. The improvement with                      bles) are composed of three cells corresponding respectively
respect to µ and µu is important since MSE is halved. The                     to the ROUGE-1, -2, -3 metrics. Not surprisingly, the results
fourth column corresponds to a nonnegative matrix factor-                     for the single sentence selection procedure (1S) are always
ization baseline, denoted γu .γi . It jointly computes latent                 worse than for the other two (CT: complete review and XS:
representations for user tastes and items characteristics. Un-                multiple sentences). This is simply because a sentence con-
surprisingly, it is our best baseline.                                        tains fewer words than a full review and it can hardly share
   It could be noted that performance tends to degrade when                   more n-grams than the full text with the reference text. For
the subset size increases. This is a side effect associated to                the ratebeer.com datasets, selecting a set of sentences clearly
the review selection process used for building the different                  offers a better performance than selecting a whole review in
datasets. Smaller datasets contain the most active users and                  all cases. Texts written to describe beers also describe the
the most commented items. The estimation of their profiles                    tasting experience. Was it in a bar or at home ? Was it a
benefits from the high number of reviews per user (and item)                  bottle or on tap ? Texts of the community share the same
in this context.                                                              structure and vocabulary to describe both the tasting and
   The last two columns refer to our hybrid recommender                       the flavors of the beer. Most users write short and precise
systems, using the two text representations introduced in                     sentences. This is an appropriate context for our sentence
section 2.2. Both fA (autoencoder) and fT (raw text) per-                     scoring model, where the habits of users are caught by our
form better than a baseline collaborative filtering system                    recommender systems. The performance slightly decreases
and both have similar approximation errors. The main dif-                     when the size of the dataset is increased. As before, this is
ference between the systems comes from the complexity of                      in accordance with the selection procedure of these datasets
the approach: during the learning step, fT is much faster                     which focuses first on the most active users and commented
than fA given the fact that no autoencoder optimization is                    items. For Amazon, the conclusion is not so clear and de-
required. On top of that, fT remains faster in the inference                  pending on the conditions, either whole reviews or selected
                      0.6
                                CT              1S              XS                                   0.6       CT             1S             XS          0.10
                                                                           0.15

                      0.5                                                                            0.5
                                                                                                                                                         0.08


                      0.4                                                                            0.4
                                                                           0.10




                                                                                     A_U210k_I120k
                                                                                                                                                         0.06
        RB_U5k_I20k



                      0.3                                                                            0.3

                                                                                                                                                         0.04
                      0.2                                                  0.05                      0.2

                                                                                                                                                         0.02
                      0.1                                                                            0.1


                      0.0                                                  0.00                      0.0                                                 0.00
                RO ndom




                RO ndom




                RO ndom




                                                                                                RO ndom




                                                                                                RO ndom




                                                                                                RO ndom
                   UG 2




                   UG 2




                   UG 2




                                                                                                   UG 2




                                                                                                   UG 2




                                                                                                   UG 2
                ROUGE-1




                ROUGE-1




                ROUGE-1




                                                                                                ROUGE-1




                                                                                                ROUGE-1




                                                                                                ROUGE-1
                    NEM-3




                    NEM-3




                    NEM-3




                                                                                                    NEM-3




                                                                                                    NEM-3




                                                                                                    NEM-3
                        F




                        F




                        F




                                                                                                        F




                                                                                                        F




                                                                                                        F
                     f_A




                     f_A




                     f_A




                                                                                                     f_A




                                                                                                     f_A




                                                                                                     f_A
                     f_T




                     f_T




                     f_T




                                                                                                     f_T




                                                                                                     f_T




                                                                                                     f_T
                ROUGE-




                ROUGE-




                ROUGE-




                                                                                                ROUGE-




                                                                                                ROUGE-




                                                                                                ROUGE-
                 Ra




                 Ra




                 Ra




                                                                                                 Ra




                                                                                                 Ra




                                                                                                 Ra
                      ROUGE-1                                   ROUGE-2   ROUGE-3                    ROUGE-1                                 ROUGE-2   ROUGE-3


                                     (a) RateBeer experiments                                                       (b) Amazon experiments

Figure 2: Histograms of the performances of the summarizer on the two biggest datasets. The scores of the ROUGE-1
metrics are represented in blue while the scores of the ROUGE-2 and ROUGE-3 metrics are in yellow and black. 7 models
are compared: random, 3 oracles, NMF based model, fA and fT based models. 3 frameworks are investigated: CT (review
extraction), 1S (One sentence extraction), XS (Multiple sentence extraction).


sentences get the best score. It is linked to the higher variety                    problem formulation and the context of short product re-
in the community of users on the website: well structured                           views, ROUGE-2,3 are clearly too constraining and the cor-
sentences like those present in RateBeer are here mixed here                        responding scores are not significant.
with different levels of English and troll reviews.
   The different models, overall, are following a clear hierar-                     3.4                    Sentiment classification evaluation
chy. First, stating the obvious, the random model has the                              The performance of the different models, using the sen-
worst performance. Then, using a recommender system to                              timent classification error as an evaluation metric, are pre-
select relevant sentences helps in terms of ROUGE-n per-                            sented in table 3. Because they give very poor performance,
formance. Using the text information brings most of the                             the bias recommendation models (µ and µu ) are not pre-
time only a small score improvement. Overall our models                             sented here. The item bias µi , second column, gives a base-
only offer small improvements here with respect to random                           line, which is improved by the matrix factorization γu .γi ,
or NMF text selection (i.e. based on rating similarity only).                       third column. Our hybrid models fA , fourth column, and
After analyzing this behavior, we believe that this is due to                       fT , fifth column, have lower classification errors than all
the shortness of the text reviews, to their relatively stan-                        the other recommender systems. The first column, LL is
dardized form (arguments are very similar from one review                           the linear support vector machine (SVM) baseline. It has
to another), to the peaked vocabulary distribution of the re-                       been learnt on the training set texts, and the regularization
views, and to the nature of ROUGE. The latter is a classical                        hyperparameter has been selected using the validation set.
recall oriented summarization evaluation measure, but does                          Our implementation relies on liblinear (LL) [4].
not distinguishes well between text candidates in this con-                            Its performance is better than the recommender systems
text. This also shows that there is room for improvement                            but it should be noted that it makes use of the actual text
on this aspect.                                                                     dui of the review, whereas the recommender systems only
   Concerning the oracle several conclusions can be drawn.                          use past information regarding user u and item i. Note
For both single sentence and complete text selection, the gap                       that even in this context, the recommender performance on
between the ROUGE measures and the proposed selection                               RateBeer is very close to the SVM baseline.
method is important suggesting that there is still room for                            It is then possible to combine the two models, according
improvements here too. For the greedy sentence selection,                           to the formulation proposed in section 2.4. The resulting
the gap between the oracles and the hybrid recommender                              hybrid approaches, denoted LL + fA and LL + fT , exploit
systems is moderate suggesting that the procedure is here                           both text based decision (SVM) and user profile (fA and fT ).
fully efficient. However this conclusion should be moderated.                       This combined model shows good classification performance
It can be observed that whereas, ROUGE is effectively an                            and overcomes the LL baseline in 4 out of 7 experiments
upper bound for single sentence or whole review selection,                          in table 3, while performing similarly to LL in the other 3
this is no more the case for multiple sentences selection.                          experiments.
Because of the complexity of selecting the best subset of
sentences according to a loss criterion (which amounts at a
combinatorial selection problem) we have been using a sub-                          4.                     OVERALL GAINS
optimal forward selection procedure: we first select the best                         In order to get a global vision of the overall gain provided
ROUGE sentence, then the second best, etc. In this case                             by the proposed approach, we summarize here the results
the ROUGE procedure is no more optimal.                                             obtained on the different tasks. For each task, the gain with
   Concerning the measures, the performance decreases rapidly                       respect to the (task dependent) baseline is computed and
when we move from ROUGE-1 to ROUGE-2, 3. Given the                                  averaged (per task) over all datasets. The metric depends
                                                                                    on the task. Results are presented in figure 3.
           10                                                      70                                                                       10
                                                                                 Gain in % w.r.t. random on rouge-n
            0                                                      60                                                                       0

           10                                                                                                                               10
                                                                   50
           20                                                                                                                               20
                                                                   40
           30                                                                                                                               30
                                                                   30
           40                                                                                                                               40
                                                                   20
           50                                                                                                                               50

           60                                                      10                                                                       60
                               Gain in % w.r.t. MSE of γu .γi                                                                                                     Gain in % w.r.t. % Good classification of LL
           70                                                       0                                                                       70
                µ   µu    µi       fA              fT                   γu .γi          fA          fT      rouge-1 rouge-2 rouge-3              µi     γu .γi   fA         fT       LL + fA LL + fT

       (a) Recommender systems.                                 (b) Summarizers.                                                                      (c) Opinion classifiers.
       Baseline=matrix factorization                            Baseline=random selection procedure                                                       Baseline=SVM

     Figure 3: Aggregated gains on the 3 tasks w.r.t. classic baselines: our hybrid recommender systems are better overall.


   For the mean squared error metric (figure 3a) the matrix                                                           laborative filtering. Given the focus of this work on con-
factorization is used as baseline. The user bias µu heavily                                                           sumer reviews, we considered collaborative filtering. For
fails to generalize on two datasets. The item bias is closer                                                          merchant websites the goal is to encourage users to buy new
to the baseline (−11.43%). Our hybrid models, which uses                                                              products and the problem is usually considered either as the
texts to refine user and item profiles bring a gain of 5.71%                                                          prediction of a ranked list of relevant items for each user [13]
for fA , 5.63% for fT . This demonstrates the interest of                                                             or as the completion of missing ratings [9]. We have focused
including textual information in the recommender system.                                                              here on the latter approach for evaluation concerns: since
Autoencoder and raw text approaches offer similar gains,                                                              we use data collected from third party sources.
the latter approach being overall faster.
   For the text generation, we take the random model as                                                               5.2    Text summarization for consumer reviews
baseline and results are presented in figure 3b. The gain is                                                             Early reference work [7] on consumer reviews has focused
computed for the three investigated framework (CT: review                                                             on global summarization of user reviews for each item. The
selection, 1S: one sentence selection, XS: multiple sentence                                                          motivation of this work was to extract the sentiments associ-
selection) and per measure (ROUGE-1, 2, 3) and then av-                                                               ated to a list of features from all the item review texts. The
eraged to one overall gain. ROUGE-n oracles clearly out-                                                              summarization took the form of a rating or of an apprecia-
perform other models, which seems intuitive. The differ-                                                              tion of each feature. Here, contrarily to this line of work, the
ent recommender systems have very close behaviors with                                                                focus is on personalized item summaries for a target user.
respective gains of 11.15% (matrix factorization), 11.89%                                                             Given the difficulty of producing a comprehensive synthetic
(auto-encoder), 11.83% (raw text). Here textual informa-                                                              summary, we have turned this problem into a sentence or
tion helps but does not clearly dominate ratings and pro-                                                             text selection process.
vide only a small improvement. Remember that although                                                                    Evaluation of summaries is challenging: how to assess the
performance improvement with respect to baselines is desir-                                                           quality of a summary when the ground truth is subjective?
able, the main novelty of the approach here is to propose                                                             In our context, the review texts are available and we used
a personalized summary generation together with the usual                                                             them as the ground truth. We have used classical ROUGE-n
rating prediction.                                                                                                    summary evaluation measures [10].
   For the opinion classifier, presented in figure 3c, the base-
line consists in a linear SVM. Basic recommender systems                                                              5.3    Sentiment classification
perform poorly with respect to the baseline (LL). Surpris-                                                               Different text latent representations have been proposed
ingly, the item bias µi (−68.71%) performs slightly better                                                            in this scope: [14] proposed a generative model to represent
than matrix factorization γu .γi (−69.54%) in the context of                                                          jointly topic and sentiments and recently, several works have
sentiment classification (no neutral reviews and binary rat-                                                          considered matrix factorization or neural network, in an at-
ings). Using textual information increases the performance.                                                           tempts to develop robust sentiment recognition systems [6].
The autoencoder based model fA (−57.17%) and raw text                                                                 [16] go further and propose to learn two types of represen-
approach fT (−58.31%) perform similarly. As discussed in                                                              tation: a vectorial model is learned for word representation
3.4, the linear SVM uses the text of the current reviews when                                                         together with a latent transformation model, which allows
the recommender systems does not. As a consequence, it is                                                             the representation of negation and quantifiers associated to
worth combining both predictions in order to exploit text                                                             an expression.
and past profiles: the resulting models give respective gains                                                            We have investigated two kinds of representation for the
of 4.72% (autoencoder) and 3.89% (raw text) w.r.t the SVM.                                                            texts: bag of words and a latent representation through the
                                                                                                                      use of autoencoders as in [6]. [11] also uses a latent represen-
5.     RELATED WORK                                                                                                   tation for representing reviews, although in a probabilistic
  Since the paper covers the topics of rating prediction, sum-                                                        setting instead in a deterministic one like we are doing here.
marization and sentiment classification, we briefly present
each of them.                                                                                                         5.4    Hybrid approaches
                                                                                                                         In the field of recommendation, a first hybrid model was
5.1     Recommender systems                                                                                           proposed by [5]: it is based on hand labeling of review sen-
  Three main families of recommendation algorithms have                                                               tences (topic and polarity) to identify relevant character-
been developed [3]: content-based knowledge-based, and col-                                                           istics of the items. [11] pushes further the exploitation of
texts, by using a joint latent representation for ratings and      representations used in the different components are more
textual content with the objective of improving the rating         closely correlated than in the present model.
accuracy. These two works are focused on rating prediction            Acknoledgements The authors would like to thank the
and do not consider delivering additional information to the       AMMICO project (F1302017 Q - FUI AAP 13) for funding
user. Very recently, [19] has considered adding an explana-        our research.
tion component to a recommender system. For that, they
propose to extract some keywords from the review texts,            7.   REFERENCES
which are supposed to explain why a user likes or dislikes
                                                                    [1] D Agarwal, BC Chen, and B Pang. Personalized
an item. This is probably the work whose spirit is closest to
                                                                        recommendation of user comments via factor models.
ours but they do not provide a quantitative evaluation.
                                                                        EMNLP’11, 2011.
   [7] combined opinion mining and text summarization on
product reviews with the goal of extracting the qualities and       [2] M Amini and N Usunier. A contextual query
defaults. [17] proposed a system for delivering personalized            expansion approach by term clustering for robust text
answers to user queries on specific products. They built the            summarization. DUC’07, 2007.
user profiles relying on topic modeling without any senti-          [3] R Burke. Hybrid recommender systems: Survey and
ment dimension. [1] proposed a personalized news recom-                 experiments. UMUAI’02, 2002.
mendation algorithm evaluated on the Yahoo portal using             [4] R-E Fan, K-W Chang, C-J Hsieh, X-R Wang, and C-J
user feedback, but it does not investigate ratings or sum-              Lin. Liblinear: A library for large linear classification.
marization issues. Overall, we propose in this article to go            JMLR’08, 2008.
beyond a generic summary of item characteristics by gener-          [5] G Ganu, N Elhadad, and A Marian. Beyond the Stars:
ating for each user a personalized summaries that is close to           Improving Rating Predictions using Review Text
what they would have written about the item themselves.                 Content. WebDB’09, 2009.
   For a long time, sentiment classification has ignored the        [6] X Glorot, A Bordes, and Y Bengio. Domain
user dimension and has focused for example on the concep-               adaptation for large-scale sentiment classification: A
tion of ”universal” sentiment classifiers able to deal with a           deep learning approach. In ICML’11, 2011.
large variety of topics [15]. Considering the user has become       [7] Minqing Hu and Bing Liu. Mining and summarizing
an issue only very recently. [18] for example exploited ex-             customer reviews. KDD ’04, page 168, 2004.
plicit relations in social graphs for improving opinion clas-       [8] N Jindal and B Liu. Opinion spam and analysis. In
sifiers, but their work is only focused on this aspect. [12]            WSDM, pages 219–230. ACM, 2008.
proposed to distinguish different rating behaviors and show         [9] Yehuda Koren, Robert Bell, and Chris Volinsky.
that modeling the review authors in a scale ranging from                Matrix factorization techniques for recommender
connoisseur to expert offers a significant gain for an opinion          systems. Computer, pages 42–49, 2009.
prediction task.                                                   [10] Chin-Yew Lin. Rouge: A package for automatic
   In our work, we have experimented the benefits of con-               evaluation of summaries. In ACL Workshop: Text
sidering the text of user reviews in recommender system for             Summarization Branches Out, 2004.
their performance as sentiment classifier. We have addition-
                                                                   [11] J McAuley and J Leskovec. Hidden factors and hidden
ally proposed, as a secondary contribution, an original model
                                                                        topics: understanding rating dimensions with review
mixing recommender systems and linear classification.
                                                                        text. RecSys’13, 2013.
                                                                   [12] JJ McAuley and J Leskovec. From amateurs to
6.   CONCLUSION                                                         connoisseurs: modeling the evolution of user expertise
   This article proposes an extended framework to the rec-              through online reviews. WWW’13, 2013.
ommendation task. The general goal is to enrich classical          [13] Matthew R McLaughlin and Jonathan L Herlocker. A
recommender systems with several dimensions. As an ex-                  Collaborative Filtering Algorithm and Evaluation
ample we show how to generate personalized reviews for                  Metric That Accurately Model the User Experience.
each recommendation using extracted summaries. This is                  In SIGIR’04, 2004.
our main contribution. We also show how rating and text            [14] Q Mei, X Ling, M Wondra, H Su, and CX Zhai. Topic
could be used to produce efficient personalized sentiment               sentiment mixture: modeling facets and opinions in
classifiers for each recommendation. Depending on the ap-               weblogs. In WWW. ACM, 2007.
plication, other additional information could be brought to        [15] B Pang and L Lee. Opinion mining and sentiment
the user. Besides producing additional information for the              analysis. Information Retrieval, 2008.
user, the different information sources can take benefit one
                                                                   [16] R Socher, B Huval, CD Manning, and A Ng. Semantic
from the other. We thus show how to effectively make use
                                                                        compositionality through recursive matrix-vector
of text review and rating informations for building improved
                                                                        spaces. In EMNLP’12. ACL, 2012.
rating predictors and review summaries. As already men-
tioned, the sentiment classifiers also benefits from the two in-   [17] C Tan, E Gabrilovich, and B Pang. To each his own:
formation sources. This part of the work demonstrates that              personalized content selection based on text
multiple information sources could be useful for improving              comprehensibility. In ICWDM’12. ACM, 2012.
recommendation systems. This is particularly interesting           [18] C Tan, L Lee, J Tang, L Jiang, M Zhou, and P Li.
since several sources are effectively available now at many             User-level sentiment analysis incorporating social
online sites. Several new applications could be developed               networks. In KDD’11. ACM, 2011.
along the lines suggested here. From a modeling point of           [19] Y Zhang, G Lai, M Zhang, Y Zhang, Y Liu, and S Ma.
view, more sophisticated approaches can be developed. We                Explicit factor models for explainable recommendation
are currently working on a multitask framework where the                based on phrase-level sentiment analysis. 2014.