Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary Mickaël Poussevin Vincent Guigue Patrick Gallinari Sorbonne-Universités UPMC Sorbonne-Universités UPMC Sorbonne-Universités UPMC LIP6 UMR 7606 CNRS LIP6 UMR 7606 CNRS LIP6 UMR 7606 CNRS 4 Place Jussieu, Paris, France 4 Place Jussieu, Paris, France 4 Place Jussieu, Paris, France mickael.poussevin@lip6.fr vincent.guigue@lip6.fr patrick.gallinari@lip6.fr Inputs Legend ABSTRACT User Rating Item Text We propose to augment rating based recommender systems User reviews: by providing the user with additional information which text and rating might help him in his choice or in the understanding of the Classic recommender systems recommendation. We consider here as a new task, the gen- User text profiles Item text profiles eration of personalized reviews associated to items. We use U Items Latent profiles an extractive summary formulation for generating these re- s views. We also show that the two information sources, rat- e = x r ings and items could be used both for estimating ratings and s Improving for generating summaries, leading to improved performance rating predictions for each system compared to the use of a single source. Be- sides these two contributions, we show how a personalized Personnalized polarity classifier can integrate the rating and textual as- ? reviews ? Rating prediction summary pects. Overall, the proposed system offers the user three per- sonalized hints for a recommendation: rating, text and po- larity. We evaluate these three components on two datasets Figure 1: Our contribution is twofold: (1) improving rating using appropriate measures for each task. predictions using textual information, (2) generating per- sonalized reviews summaries to push recommender systems beyond rating predictions 1. INTRODUCTION The emergence of the participative web has enabled users to easily give their sentiments on many different topics. This opinionated data flow thus grows rapidly and offers oppor- diction, some authors have make use of additional informa- tunities for several applications like e-reputation manage- tion sources available on typical e-commerce sites. [5] pro- ment or recommendation. Today many e-commerce web- posed to extract topics from consumer reviews in order to sites present each item available on their platform with a improve ratings predictions. Recently, [11] proposed to learn description of its characteristics, average appreciation, rat- a latent space common to both textual reviews and product ings together with individual user reviews explaining their ratings, they showed that rating prediction was improved ratings. by such hybrid recommender systems. Concerning the in- Our focus here is on user - item recommendation. This is a formation provided to the user, some models exploit review multifaceted task where different information sources about texts for ranking comments that users may like [1] or for users and items could be considered and different recommen- answering specific user queries [17]. dation information could be provided to the user. Despite We start here from the perspective of predicting user pref- this diversity, the academic literature on recommender sys- erence and argue that the exploitation of the information tems has focused only on a few specific tasks. The most present in many e-commerce sites, allows us to go beyond popular one is certainly the prediction of user preferences simple rating prediction for presenting users with comple- given their past rating profile. These systems typically rely mentary information that may help him making his choice. on collaborative filtering [9] to predict missing values in a We consider as an example the generation of a personalized user/item/rating matrix. In this perspective of rating pre- review accompanying each item recommendation. Such a review is a source of complementary evidence for the user appreciation of a suggestion. Similarly as it is done for the ratings, we exploit past information and user similarity in order to generate these reviews. Since pure text generation is a very challenging task [2], we adopt an extractive sum- mary perspective: the generated text accompanying each rating will be extracted from the reviews of selected users who share similar tastes and appreciations with the target CBRecSys 2015, September 20, 2015, Vienna, Austria. user. Ratings and reviews being correlated, this aspect could Copyright remains with the authors and/or original copyright holders. also be exploited to improve the predictions. Our rating pre- dictor will make use of user textual profiles extracted from κui sentences: dui = {suik , 1 ≤ k ≤ κui }. In this work, their reviews and summary extraction in turn will use pre- we consider documents as bags of sentences. To simplify dicted ratings. Thus both types of information, predicted notations, suik is replaced by sui when there is no ambigu- ratings and generated text reviews, are offered to the user ity. Thus, user appreciations are quadruplets (u, i, rui , dui ). and each prediction, rating and generated text, takes into Recommender systems use past information to compute a account the two sources of information. Additional infor- rating prediction r̂ui , the corresponding prediction function mation could also be provided to the user. We show here is denoted f (u, i). as an example, that predicted ratings and review texts can For the experiments, ratings and text reviews are split be used to train a robust sentiment classifier which provides into training, validation and test sets respectively denoted the user with a personalized polarity indication about the Strain , Sval and Stest and containing mtrain , mval and mtest (u) item. The modules of our system are evaluated on the two user appreciations (text and rating). We denote Strain , the main tasks, rating prediction and summary extraction, and subset of all reviews Strain that were written by user u and on the secondary task of sentiment prediction. For this, (u) (i) mtrain the number of such reviews. Similarly, Strain and experiments are conducted on real datasets collected from (i) mtrain are used for the reviews on item i. amazon.com and ratebeer.com and models are compared to classical baselines. 2.2 Hybrid recommender system with text pro- The recommender system is compared to a classic collab- orative filtering model using the mean squared error metric. files We show that using both ratings and user textual profiles Recommender systems classically use rating history to allows us to improve the performance of a baseline recom- predict the rating r̂ui that user u will give to item i. The mender. Gains are motivated from a more precise under- hybrid system described here makes use of both collaborative standing of the key aspects and opinions included in the filtering through matrix factorization and textual informa- item and user textual profiles. For evaluating summary text tion to produce a rating as described in (1): generation associated to a couple (user, item), we have at our disposal a gold standard, the very review text written f (u, i) = µ + µu + µi + γu .γi + g(u, i) (1) by this user on the item. Note that this is a rare situation in summary evaluation. However contrarily to collaborative The first three predictors in equation (1) are biases (over- filtering, there is no consensual baseline. We then compare all bias, user bias and item bias). The fourth predictor is our results to a random model and to oracle optimizing the a classical matrix factorization term. The novelty of our ROUGE-n metric. They respectively provide a lower and model comes from the fifth term (1) that takes into account an upper bound of the attainable performance. The sen- text profiles to refine the prediction f . Our aim for the timent classifier is classically evaluated using classification rating prediction is to minimize the following empirical loss accuracy. function: This article is organized as follows. The hybrid formu- 1 X lation, the review generator and the sentiment classifier are argmin L = (rui − f (u, i))2 (2) presented in section 2. Then, section 3 gives an extensive ex- µ,µu ,µi ,γu ,γi ,g mtrain S train perimental evaluation of the framework. The overall gains associated to hybrid models are discussed in section 4. A To simplify the learning procedure, we first optimize the pa- review of related work is provided in section 5. rameters of the different components independently as de- scribed in the following subsections. Then we fine tune the combination of these components by learning weighting co- 2. MODELS efficients so as to maximize the performance criterion (2) on In this section, after introducing the notations used through- the validation set. out the paper, we will describe successively the three mod- ules of our system. We start by considering the prediction 2.2.1 Matrix factorization of ratings [11]. Rating predictors answer the following ques- We first compute the different bias from eq. (1) as the tion: what rating will this user give to this item? We present averaged ratings over their respective domains (overall, user a simple and efficient way to introduce text profiles repre- and item). For the matrix factorization term, we approxi- senting the writing style and taste of the user in a hybrid mate the rating matrix RU I using two latent factors: RU I ≈ formulation. We then show how to exploit reviews and rat- ΓU ΓTI . Both ΓU and ΓI are two matrices representing collec- ings in a new challenging task: what text will this user write tions of latent profiles, with one profile per row. We denote about this item? We propose an extractive summary formu- γu (resp. γi ) the row of ΓU (resp. ΓI ) corresponding to the lation of this task. We then proceed to describe how both latent profile of user u (resp. item i). ratings and text could be used together in a personalized The profiles are learned by minimizing, on the training set, sentiment classifier. the mean squared error between known ratings in matrix RU I and the approximation provided by the factorization 2.1 Notations ΓU ΓTI . This minimization problem described in equation We use u (respectively i) to refer to everything related to (3), with an additional L2 constraint (4) on the factors is a user (respectively to an item) and the rating given by user solved here using non-negative matrix factorization. u to the item i is denoted rui . U and I refer to anything relative to all users and all items, such as the rating matrix RU I . Similarly, lower case letters are used for scalars or vec- Γ∗U , Γ∗I = argmin kMtrain (RU I − ΓU ΓI )k2F (3) ΓU ,ΓI tors and upper case letters for matrices. dui is the actual review text written by user u for item i. It is composed of +λU kΓU k2F + λI kΓI k2F (4) In this equation Mtrain is a binary mask that has the computational load. All sentences are represented as binary same dimensions as matrix RU I , an entry is 1 only if the bag of words using this dictionary. The coding dimension corresponding review is in the training set, is the element- has been set to 1000 after a few evaluation trials. Note that wise product and k · kF denotes the Frobenius norm. the precise value of this latent space is not important and the performance is similar on a large range of dimension values. 2.2.2 Text profiles exploitation 1 Both cod and dec use sigmoid units sig(t) = 1+exp(−t) : Let us denote πu the text profile of user u and σt (πu0 , πu ) a similarity operator between user profiles. The last compo- cod(suik ) = πuik = sig(W suik + b) (8) nent of the predictor f in (1) is a weighted average of user dec(πuik ) = sig(W T πuik + b0 ) ratings for item i, where weight σt (πu0 , πu ) is the similarity Here, πuik is a vector, W is a 5000x1000 weight matrix between the text profiles πu0 and πu of users u0 and u, the and sig() is a pointwise sigmoid operator operating on the latter being the target user. This term takes into account vector W suik + b. the fact that two users with similar styles or using similar ex- As motivated in [11, 5], such a latent representation helps pressions in their appreciation of an item, should share close exploiting term co-occurrences and thus introduces some se- ratings on this item. The prediction term for the user/item mantic. It provides a robust text representation. The hidden couple (u, i) is then expressed as a weighted mean: activity of this neural network produces a continuous rep- resentation for each sentence accounting for the presence or 1 X absence of groups of words. g(u, i) = (i) ru0 i σt (πu0 , πu ) (5) mtrain (i) πu is obtained by coding the vector corresponding to all S train text written by the user u in the past. It lies in a latent Two different representations for the text profiles πu of word space where a low Euclidean distance between users the users are investigated in this article: one is based on means a similar usage of words. Thus, for the similarity σt , a latent representation of the texts obtained by a neural we use an inverse Euclidean distance in the latent space: network autoencoder, the other relies on a robust bag of σt (πu0 , πu ) = 1/(α + kπu0 − πu k) (9) words coding. Each one is associated to a dedicated metric σt . 2.2.3 Global training criterion for ratings prediction This leads to two formulations of g, and thus, to two rating In order to connect all the elementary components de- prediction models. We denote the former fA (autoencoder) scribed above with respect to our recommendation task, we and the latter fT (bag of words). Details are provided below. introduce (positive) weighting parameters β in (1). Thus, the initial optimization problem (2) becomes: Bag of words. β ∗ = argmin mtrain 1 P A preprocessing step removes all words appearing in less Strain β than 10 documents. Then, the 100 000 most frequent words  2 (10) are kept. Although the number of features is large, the rep- rui − β1 µ∗ + β2 µ∗u + β3 µ∗i + β4 γu∗ .γi∗ + β5 g(u, i) resentation is sparse and scales well. πu is simply the binary bag of words of all texts of user u. In this high dimensional The linear combination is optimized using a validation set: space, the proximity in style between two users is well de- this step guaranties that all components are combined in an scribed by a cosine function, a high value indicates similar optimal manner. usage of words: 2.3 Text generation model σt (πu0 , πu ) = πu0 πu /(kπu0 kkπu k) (6) The goal here is to generate a review text for each (u,i) recommendation. During the recommendation process, this text is an additional information for users to consider. It Autoencoder. should catch their interest and in principle be close to the The neural network autoencoder has two components: a one that user u could have written himself on item i. Each coding operator and a decoding operator denoted respec- text is generated as an extractive summary, where the ex- tively cod and dec. The two vectorial operators are learned tracted sentences su0 i come from the reviews written by so as to enable the reconstruction of the original text after other users (u0 6= u) about item i. Sentence selection is a projection in the latent space. Namely, given a sentence performed according to a criterion which combines a simi- suik represented as a binary bag of words vector, we obtain larity between the sentence and the textual user profile and a latent profile πsuik = cod(suik ) and then, we reconstruct a similarity between the actual rating ru0 i and the predic- an approximation of the sentence using ŝuik = dec(πsuik ). tion made for (u,i), r̂ui computed as described in section 2.2. The autoencoder is optimized so as to minimize the re- The former measure could take into account several dimen- construction error over the training set: sions like vocabulary, sentiment expression and even style, κui here it is mainly the vocabulary which is exploited. The X 1 X cod∗ , dec∗ = argmin ksuik − dec(cod(suik ))k2 latter measures the proximity between user tastes. For the cod,dec Strain κui text measure, we make use of the σt similarity introduced k=1 (7) in section 2.2. As before, we will consider two representa- We use the settings proposed in [6]: our dictionary is ob- tions for texts (latent coding and raw bag of words). For the tained after stopwords removal and selecting the most fre- ratings similarity, we use σr (ru0 i , rui ) = 1/(1 + |ru0 i − rui |). quent 5000 words. we did not use a larger dictionary such as Suppose one wants to select a single sentence for the ex- the one used for the bag of word representation since it does tracted summary. The sentence selection criterion will then not lead to improved performance and simply increases the be a simple average of the two similarities: Source Subset names #Users #Items #Reviews #Training #Validation #Test RB U50 I200 52 200 7200 900 906 r σt (su0 i , πu ) + σr (ru0 i , r̂ui ) ee h(su0 i , ru0 i , u0 , u, i) = RB U500 I2k 520 2000 388200 48525 48533 eb (11) at 2 RB U5k I20k 5200 20000 1887608 235951 235960 R A U200 I120 213 122 984 123 130 on Note that this function may score any piece of text. In A U2k I1k 2135 1225 31528 3941 3946 az A U20k I12k 21353 12253 334256 41782 41791 m the following, we then consider three possibilities for gener- A A U210k I120k 213536 122538 1580576 197572 197574 ating text reviews: The first one simply consists in selecting the best sentence su0 i among all the training sentences for Table 1: Users, items & reviews counts for every datasets. item i with respect to h. We call it 1S for single sentence. Subsets µ µu µi γu .γi fA fT The second one selects a whole review du0 i among all the RB U50 I200 0.7476 0.7291 0.3096 0.2832 0.2772 0.2773 RB U500 I2k 0.6536 0.6074 0.3359 0.3168 0.3051 0.3051 reviews for i. The document is here considered as one long RB U5k I20k 0.7559 0.6640 0.3912 0.3555 0.3451 0.3451 sentence. This is denoted CT for complete text. The third A U200 I120 1.5348 2.0523 1.6563 1.7081 1.4665 1.4745 one is a greedy procedure that selects multiple sentences, it A U2k I1k 1.5316 1.4391 1.3116 1.0927 1.0483 1.0485 is denoted XS. It is initialized with 1S, and then sentences A U20k I12k 1.4711 1.4241 1.2849 1.0797 1.0426 1.0426 A U210k I120k 1.5072 2.1154 1.5318 1.2915 1.1671 1.1678 are selected under two criteria: relevance with respect to h and diversity with respect to the sentences already selected. Table 2: Test performance (mean squared error) for rec- Selection is stopped when the length of the text is greater ommendation. µ, µu , µi are the overall bias, user bias and than the average length of the texts of the target user. Al- item bias baselines. γu .γi is the plain matrix factorization gorithm 1 sums up the XS procedure for generating the text baseline. fA , fT are our hybrid recommender systems rely- dˆui for the couple user u, item i. ing respectively on latent and raw text representations. The different datasets are described in table 1. Data: u, i, S = {(su0 i , ru0 i u0 } Result: dˆui s∗u0 i ← argmax h(su0 i , ru0 i , u0 , u, i) ;  lem: su0 i ∈S  X   dˆui ← s∗u0 i ; w∗ = argmin 1− dui .w+f (u, i) cui +λkwk2 + Remove s∗u0 i from S; w Strain ,rui 6=3 while length dˆui < averagelength(u) do (12) s∗u0 i ← argmax h(su0 i , ru0 i , u0 , u, i) − cos(su0 i , dˆui ) ;  with (x)+ = x when x positive and (x)+ = 0 elsewhere. In su0 i ∈S the experimental section, we will also compare the results dˆui ← s∗u0 i ; obtained with the two versions of our rating predictor: fT Remove s∗u0 i from S; and fA (cf section 2.2.2). end Algorithm 1: XS greedy procedure: selection of successive 3. EXPERIMENTS sentences to maximize both relevance and diversity. dˆui is All three modules, ratings, text, sentiments, are evaluated the text that is generated, sentence after sentence. independently since there is no global evaluation framework. These individual performances should however provide to- gether a quantitative appreciation of the whole system. 2.4 Sentiment prediction model We use two real world datasets of user reviews, collected We show here how polarity information about an item can from amazon.com [8] and ratebeer.com [11]. Their charac- be estimated by exploiting both the user predicted ratings teristics are presented in table 1. and his textual profile. Exploiting both information sources Below, one presents first how datasets are preprocessed improves the sentiment prediction performance compared in 3.1. The benefits of incorporating the text in the ratings with a usual text based sentiment classifier. prediction for the recommender system are then discussed in Polarity classification is the task of predicting whether a section 3.2. The quality of the generated reviews is evaluated text dui (here of a review) is positive or negative. We use as and analyzed in section 3.3 Finally, the performance of the ground truth the ratings rui and follow a standard thresh- sentiment classifier combining text and ratings is described olding procedure [15]: reviews rated 1 or 2 are considered in 3.4. as negative, while items rated 4 or 5 are positive. All texts that are rated 3 are ignored as it is unclear whether that 3.1 Data preprocessing are positive or negative: it strongly depends on the rating Reviews from different websites have different formats (rat- habits of the user. ing scales, multiple ratings, . . . ). We focus on the global For evaluation purpose, we consider two baselines. A first rating and scaled it to a 1 to 5 integer range. For titled one only uses the rating prediction of our recommender sys- reviews, the title is considered as the first sentence of the tem f (u, i) as a label prediction, this value is then thresh- text of the review. Each dataset is randomly split into three olded as indicated above. A second one is a classical text parts: training, validation and test containing respectively sentiment classifier. Denoting by dui the binary bag of word 80%, 10% and 10% of the reviews. representation of a document and cui the binary label associ- As described in 2.2, two representations of the text are ated to the rating rui , one uses a linear SVM s(dui ) = dui .w. considered each with a different dictionary: Note that this is usually a strong baseline for the polarity classification task. Our final classifier will combine f (u, i) • for the autoencoder, we have selected the 5000 most and s(dui ) in order to solve the following optimization prob- frequent words, with a stopwords removal step; The autoencoder input vector is then a binary vector of step: the inherent sparsity of the bag of word representation dimension 5000. enables fT to provide faster computations than fA . The au- toencoder works in a smaller dimensional space but it is not • for the raw representation, we have selected the 100000 sparse. most frequent words appearing in more than 10 docu- ments (including stopwords) and used a binary vector representation. 3.3 Text generation evaluation We move on now to the evaluation of the personalized For the experiments, we consider several subsets of the review text generation module. Since we are using an ex- databases with different numbers of users and items. Each tractive summary procedure, we make use of a classical loss dataset is built by extracting, for a given number of users used for summarization systems: we use a recall-oriented and items, the most active users and the most commented ROUGE-n metrics, by comparing the generated text against items. Dataset characteristics are given in table 1. the actual text of the review produced by the user. As far as we know, generating candidate reviews has never been dealt Subsets LL µi γu .γi fA fT LL + fA LL + fT with in this context and this is a novel task. The ROUGE-n RB U50 I200 5.35 5.12 6.01 5.57 5.57 3.79 3.79 RB U500 I2k 7.18 10.67 9.73 8.55 8.55 6.52 6.92 metric is the proportion of n-grams of the actual text found RB U5k I20k 8.44 11.80 10.04 9.17 9.17 8.33 8.35 in the predicted (candidate) text, we use n = {1, 2, 3}. The A U200 I120 10.00 15.83 22.50 20.00 20.83 10.00 10.00 higher ROUGE-n is, the better the quality of the candidate A U2k I1k 7.89 15.25 12.85 12.62 12.62 7.54 7.54 A U20k I12k 6.34 13.99 12.79 12.38 12.37 6.29 6.29 text is. A good ROUGE-1 means that topics or vocabu- A U210k I120k 6.25 14.04 14.40 13.32 13.31 6.22 6.22 lary are correctly caught while ROUGE-2 and ROUGE-3 are more representative of the user’s style. Table 3: Test performance (classification error) as polarity A first baseline is given by using a random scoring function classifiers. LL stands for LibLinear (SVM), µi , γu .γi , fA , h (instead of the formulation given in (11)). It provides a fT are the recommender systems as in table 2. LL + fA lower bound of the performance. Three oracles are then and LL + fT are two hybrid opinion classification models used to provide an upper bound on the performance. They combining the SVM classifier and fA and fT recommender directly optimize the metrics ROUGE-n from the data on systems. the test set.A matrix factorization baseline is also used. It is a special case of our model where no text information is used. This model computes a similar score for all the sentences of 3.2 Recommender system evaluation a given user and relative to an item. When one sentence Let us first consider the evaluation of the rating predic- only is selected, it is taken at random among the sentences tion. The metric used here is the mean squared error (MSE) of this user for the item. With greedy selection, the first between rating predictions r̂ui and actual ratings rui . The sentence is chosen at random and then the cosine diversity lower the MSE is, the better the model is able to estimate term (algorithm 1) allows a ranking of the next candidate the correspondence between user tastes and items. Results sentences. Our proposed method is evaluated with the two are presented in table 2. different user profile πu representation (auto-encoder and The models are referenced using the notations introduced raw text). The performance of these seven models on the in section 2.2. The first column corresponds to a trivial the two biggest datasets with respect to the three metrics system which predicts µ the overall bias, the second predicts are aggregated in figure 2. the user bias µu . Both give poor performance as expected. An histogram corresponds to a text selection entity (whole The third column corresponds to the item bias µi base- review text, best single sentence, greedy sentence selection. line. It assumes that user taste is not relevant and that each Groups in the histograms (respectively row block of the ta- item has its own intrinsic quality. The improvement with bles) are composed of three cells corresponding respectively respect to µ and µu is important since MSE is halved. The to the ROUGE-1, -2, -3 metrics. Not surprisingly, the results fourth column corresponds to a nonnegative matrix factor- for the single sentence selection procedure (1S) are always ization baseline, denoted γu .γi . It jointly computes latent worse than for the other two (CT: complete review and XS: representations for user tastes and items characteristics. Un- multiple sentences). This is simply because a sentence con- surprisingly, it is our best baseline. tains fewer words than a full review and it can hardly share It could be noted that performance tends to degrade when more n-grams than the full text with the reference text. For the subset size increases. This is a side effect associated to the ratebeer.com datasets, selecting a set of sentences clearly the review selection process used for building the different offers a better performance than selecting a whole review in datasets. Smaller datasets contain the most active users and all cases. Texts written to describe beers also describe the the most commented items. The estimation of their profiles tasting experience. Was it in a bar or at home ? Was it a benefits from the high number of reviews per user (and item) bottle or on tap ? Texts of the community share the same in this context. structure and vocabulary to describe both the tasting and The last two columns refer to our hybrid recommender the flavors of the beer. Most users write short and precise systems, using the two text representations introduced in sentences. This is an appropriate context for our sentence section 2.2. Both fA (autoencoder) and fT (raw text) per- scoring model, where the habits of users are caught by our form better than a baseline collaborative filtering system recommender systems. The performance slightly decreases and both have similar approximation errors. The main dif- when the size of the dataset is increased. As before, this is ference between the systems comes from the complexity of in accordance with the selection procedure of these datasets the approach: during the learning step, fT is much faster which focuses first on the most active users and commented than fA given the fact that no autoencoder optimization is items. For Amazon, the conclusion is not so clear and de- required. On top of that, fT remains faster in the inference pending on the conditions, either whole reviews or selected 0.6 CT 1S XS 0.6 CT 1S XS 0.10 0.15 0.5 0.5 0.08 0.4 0.4 0.10 A_U210k_I120k 0.06 RB_U5k_I20k 0.3 0.3 0.04 0.2 0.05 0.2 0.02 0.1 0.1 0.0 0.00 0.0 0.00 RO ndom RO ndom RO ndom RO ndom RO ndom RO ndom UG 2 UG 2 UG 2 UG 2 UG 2 UG 2 ROUGE-1 ROUGE-1 ROUGE-1 ROUGE-1 ROUGE-1 ROUGE-1 NEM-3 NEM-3 NEM-3 NEM-3 NEM-3 NEM-3 F F F F F F f_A f_A f_A f_A f_A f_A f_T f_T f_T f_T f_T f_T ROUGE- ROUGE- ROUGE- ROUGE- ROUGE- ROUGE- Ra Ra Ra Ra Ra Ra ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-1 ROUGE-2 ROUGE-3 (a) RateBeer experiments (b) Amazon experiments Figure 2: Histograms of the performances of the summarizer on the two biggest datasets. The scores of the ROUGE-1 metrics are represented in blue while the scores of the ROUGE-2 and ROUGE-3 metrics are in yellow and black. 7 models are compared: random, 3 oracles, NMF based model, fA and fT based models. 3 frameworks are investigated: CT (review extraction), 1S (One sentence extraction), XS (Multiple sentence extraction). sentences get the best score. It is linked to the higher variety problem formulation and the context of short product re- in the community of users on the website: well structured views, ROUGE-2,3 are clearly too constraining and the cor- sentences like those present in RateBeer are here mixed here responding scores are not significant. with different levels of English and troll reviews. The different models, overall, are following a clear hierar- 3.4 Sentiment classification evaluation chy. First, stating the obvious, the random model has the The performance of the different models, using the sen- worst performance. Then, using a recommender system to timent classification error as an evaluation metric, are pre- select relevant sentences helps in terms of ROUGE-n per- sented in table 3. Because they give very poor performance, formance. Using the text information brings most of the the bias recommendation models (µ and µu ) are not pre- time only a small score improvement. Overall our models sented here. The item bias µi , second column, gives a base- only offer small improvements here with respect to random line, which is improved by the matrix factorization γu .γi , or NMF text selection (i.e. based on rating similarity only). third column. Our hybrid models fA , fourth column, and After analyzing this behavior, we believe that this is due to fT , fifth column, have lower classification errors than all the shortness of the text reviews, to their relatively stan- the other recommender systems. The first column, LL is dardized form (arguments are very similar from one review the linear support vector machine (SVM) baseline. It has to another), to the peaked vocabulary distribution of the re- been learnt on the training set texts, and the regularization views, and to the nature of ROUGE. The latter is a classical hyperparameter has been selected using the validation set. recall oriented summarization evaluation measure, but does Our implementation relies on liblinear (LL) [4]. not distinguishes well between text candidates in this con- Its performance is better than the recommender systems text. This also shows that there is room for improvement but it should be noted that it makes use of the actual text on this aspect. dui of the review, whereas the recommender systems only Concerning the oracle several conclusions can be drawn. use past information regarding user u and item i. Note For both single sentence and complete text selection, the gap that even in this context, the recommender performance on between the ROUGE measures and the proposed selection RateBeer is very close to the SVM baseline. method is important suggesting that there is still room for It is then possible to combine the two models, according improvements here too. For the greedy sentence selection, to the formulation proposed in section 2.4. The resulting the gap between the oracles and the hybrid recommender hybrid approaches, denoted LL + fA and LL + fT , exploit systems is moderate suggesting that the procedure is here both text based decision (SVM) and user profile (fA and fT ). fully efficient. However this conclusion should be moderated. This combined model shows good classification performance It can be observed that whereas, ROUGE is effectively an and overcomes the LL baseline in 4 out of 7 experiments upper bound for single sentence or whole review selection, in table 3, while performing similarly to LL in the other 3 this is no more the case for multiple sentences selection. experiments. Because of the complexity of selecting the best subset of sentences according to a loss criterion (which amounts at a combinatorial selection problem) we have been using a sub- 4. OVERALL GAINS optimal forward selection procedure: we first select the best In order to get a global vision of the overall gain provided ROUGE sentence, then the second best, etc. In this case by the proposed approach, we summarize here the results the ROUGE procedure is no more optimal. obtained on the different tasks. For each task, the gain with Concerning the measures, the performance decreases rapidly respect to the (task dependent) baseline is computed and when we move from ROUGE-1 to ROUGE-2, 3. Given the averaged (per task) over all datasets. The metric depends on the task. Results are presented in figure 3. 10 70 10 Gain in % w.r.t. random on rouge-n 0 60 0 10 10 50 20 20 40 30 30 30 40 40 20 50 50 60 10 60 Gain in % w.r.t. MSE of γu .γi Gain in % w.r.t. % Good classification of LL 70 0 70 µ µu µi fA fT γu .γi fA fT rouge-1 rouge-2 rouge-3 µi γu .γi fA fT LL + fA LL + fT (a) Recommender systems. (b) Summarizers. (c) Opinion classifiers. Baseline=matrix factorization Baseline=random selection procedure Baseline=SVM Figure 3: Aggregated gains on the 3 tasks w.r.t. classic baselines: our hybrid recommender systems are better overall. For the mean squared error metric (figure 3a) the matrix laborative filtering. Given the focus of this work on con- factorization is used as baseline. The user bias µu heavily sumer reviews, we considered collaborative filtering. For fails to generalize on two datasets. The item bias is closer merchant websites the goal is to encourage users to buy new to the baseline (−11.43%). Our hybrid models, which uses products and the problem is usually considered either as the texts to refine user and item profiles bring a gain of 5.71% prediction of a ranked list of relevant items for each user [13] for fA , 5.63% for fT . This demonstrates the interest of or as the completion of missing ratings [9]. We have focused including textual information in the recommender system. here on the latter approach for evaluation concerns: since Autoencoder and raw text approaches offer similar gains, we use data collected from third party sources. the latter approach being overall faster. For the text generation, we take the random model as 5.2 Text summarization for consumer reviews baseline and results are presented in figure 3b. The gain is Early reference work [7] on consumer reviews has focused computed for the three investigated framework (CT: review on global summarization of user reviews for each item. The selection, 1S: one sentence selection, XS: multiple sentence motivation of this work was to extract the sentiments associ- selection) and per measure (ROUGE-1, 2, 3) and then av- ated to a list of features from all the item review texts. The eraged to one overall gain. ROUGE-n oracles clearly out- summarization took the form of a rating or of an apprecia- perform other models, which seems intuitive. The differ- tion of each feature. Here, contrarily to this line of work, the ent recommender systems have very close behaviors with focus is on personalized item summaries for a target user. respective gains of 11.15% (matrix factorization), 11.89% Given the difficulty of producing a comprehensive synthetic (auto-encoder), 11.83% (raw text). Here textual informa- summary, we have turned this problem into a sentence or tion helps but does not clearly dominate ratings and pro- text selection process. vide only a small improvement. Remember that although Evaluation of summaries is challenging: how to assess the performance improvement with respect to baselines is desir- quality of a summary when the ground truth is subjective? able, the main novelty of the approach here is to propose In our context, the review texts are available and we used a personalized summary generation together with the usual them as the ground truth. We have used classical ROUGE-n rating prediction. summary evaluation measures [10]. For the opinion classifier, presented in figure 3c, the base- line consists in a linear SVM. Basic recommender systems 5.3 Sentiment classification perform poorly with respect to the baseline (LL). Surpris- Different text latent representations have been proposed ingly, the item bias µi (−68.71%) performs slightly better in this scope: [14] proposed a generative model to represent than matrix factorization γu .γi (−69.54%) in the context of jointly topic and sentiments and recently, several works have sentiment classification (no neutral reviews and binary rat- considered matrix factorization or neural network, in an at- ings). Using textual information increases the performance. tempts to develop robust sentiment recognition systems [6]. The autoencoder based model fA (−57.17%) and raw text [16] go further and propose to learn two types of represen- approach fT (−58.31%) perform similarly. As discussed in tation: a vectorial model is learned for word representation 3.4, the linear SVM uses the text of the current reviews when together with a latent transformation model, which allows the recommender systems does not. As a consequence, it is the representation of negation and quantifiers associated to worth combining both predictions in order to exploit text an expression. and past profiles: the resulting models give respective gains We have investigated two kinds of representation for the of 4.72% (autoencoder) and 3.89% (raw text) w.r.t the SVM. texts: bag of words and a latent representation through the use of autoencoders as in [6]. [11] also uses a latent represen- 5. RELATED WORK tation for representing reviews, although in a probabilistic Since the paper covers the topics of rating prediction, sum- setting instead in a deterministic one like we are doing here. marization and sentiment classification, we briefly present each of them. 5.4 Hybrid approaches In the field of recommendation, a first hybrid model was 5.1 Recommender systems proposed by [5]: it is based on hand labeling of review sen- Three main families of recommendation algorithms have tences (topic and polarity) to identify relevant character- been developed [3]: content-based knowledge-based, and col- istics of the items. [11] pushes further the exploitation of texts, by using a joint latent representation for ratings and representations used in the different components are more textual content with the objective of improving the rating closely correlated than in the present model. accuracy. These two works are focused on rating prediction Acknoledgements The authors would like to thank the and do not consider delivering additional information to the AMMICO project (F1302017 Q - FUI AAP 13) for funding user. Very recently, [19] has considered adding an explana- our research. tion component to a recommender system. For that, they propose to extract some keywords from the review texts, 7. REFERENCES which are supposed to explain why a user likes or dislikes [1] D Agarwal, BC Chen, and B Pang. Personalized an item. This is probably the work whose spirit is closest to recommendation of user comments via factor models. ours but they do not provide a quantitative evaluation. EMNLP’11, 2011. [7] combined opinion mining and text summarization on product reviews with the goal of extracting the qualities and [2] M Amini and N Usunier. A contextual query defaults. [17] proposed a system for delivering personalized expansion approach by term clustering for robust text answers to user queries on specific products. They built the summarization. DUC’07, 2007. user profiles relying on topic modeling without any senti- [3] R Burke. Hybrid recommender systems: Survey and ment dimension. [1] proposed a personalized news recom- experiments. UMUAI’02, 2002. mendation algorithm evaluated on the Yahoo portal using [4] R-E Fan, K-W Chang, C-J Hsieh, X-R Wang, and C-J user feedback, but it does not investigate ratings or sum- Lin. Liblinear: A library for large linear classification. marization issues. Overall, we propose in this article to go JMLR’08, 2008. beyond a generic summary of item characteristics by gener- [5] G Ganu, N Elhadad, and A Marian. Beyond the Stars: ating for each user a personalized summaries that is close to Improving Rating Predictions using Review Text what they would have written about the item themselves. Content. WebDB’09, 2009. For a long time, sentiment classification has ignored the [6] X Glorot, A Bordes, and Y Bengio. Domain user dimension and has focused for example on the concep- adaptation for large-scale sentiment classification: A tion of ”universal” sentiment classifiers able to deal with a deep learning approach. In ICML’11, 2011. large variety of topics [15]. Considering the user has become [7] Minqing Hu and Bing Liu. Mining and summarizing an issue only very recently. [18] for example exploited ex- customer reviews. KDD ’04, page 168, 2004. plicit relations in social graphs for improving opinion clas- [8] N Jindal and B Liu. Opinion spam and analysis. In sifiers, but their work is only focused on this aspect. [12] WSDM, pages 219–230. ACM, 2008. proposed to distinguish different rating behaviors and show [9] Yehuda Koren, Robert Bell, and Chris Volinsky. that modeling the review authors in a scale ranging from Matrix factorization techniques for recommender connoisseur to expert offers a significant gain for an opinion systems. Computer, pages 42–49, 2009. prediction task. [10] Chin-Yew Lin. Rouge: A package for automatic In our work, we have experimented the benefits of con- evaluation of summaries. In ACL Workshop: Text sidering the text of user reviews in recommender system for Summarization Branches Out, 2004. their performance as sentiment classifier. We have addition- [11] J McAuley and J Leskovec. Hidden factors and hidden ally proposed, as a secondary contribution, an original model topics: understanding rating dimensions with review mixing recommender systems and linear classification. text. RecSys’13, 2013. [12] JJ McAuley and J Leskovec. From amateurs to 6. CONCLUSION connoisseurs: modeling the evolution of user expertise This article proposes an extended framework to the rec- through online reviews. WWW’13, 2013. ommendation task. The general goal is to enrich classical [13] Matthew R McLaughlin and Jonathan L Herlocker. A recommender systems with several dimensions. As an ex- Collaborative Filtering Algorithm and Evaluation ample we show how to generate personalized reviews for Metric That Accurately Model the User Experience. each recommendation using extracted summaries. This is In SIGIR’04, 2004. our main contribution. We also show how rating and text [14] Q Mei, X Ling, M Wondra, H Su, and CX Zhai. Topic could be used to produce efficient personalized sentiment sentiment mixture: modeling facets and opinions in classifiers for each recommendation. Depending on the ap- weblogs. In WWW. ACM, 2007. plication, other additional information could be brought to [15] B Pang and L Lee. Opinion mining and sentiment the user. Besides producing additional information for the analysis. Information Retrieval, 2008. user, the different information sources can take benefit one [16] R Socher, B Huval, CD Manning, and A Ng. Semantic from the other. We thus show how to effectively make use compositionality through recursive matrix-vector of text review and rating informations for building improved spaces. In EMNLP’12. ACL, 2012. rating predictors and review summaries. As already men- tioned, the sentiment classifiers also benefits from the two in- [17] C Tan, E Gabrilovich, and B Pang. To each his own: formation sources. This part of the work demonstrates that personalized content selection based on text multiple information sources could be useful for improving comprehensibility. In ICWDM’12. ACM, 2012. recommendation systems. This is particularly interesting [18] C Tan, L Lee, J Tang, L Jiang, M Zhou, and P Li. since several sources are effectively available now at many User-level sentiment analysis incorporating social online sites. Several new applications could be developed networks. In KDD’11. ACM, 2011. along the lines suggested here. From a modeling point of [19] Y Zhang, G Lai, M Zhang, Y Zhang, Y Liu, and S Ma. view, more sophisticated approaches can be developed. We Explicit factor models for explainable recommendation are currently working on a multitask framework where the based on phrase-level sentiment analysis. 2014.