1. INTRODUCTION

September

Extended Recommendation Framework: Generating the Text of a User Review as a Personalized Summary

Mickaël Poussevin

mickael.poussevin@lip6.fr 0

Vincent Guigue

vincent.guigue@lip6.fr 1

Patrick Gallinari

patrick.gallinari@lip6.fr 2 0 Sorbonne-Universités UPMC, LIP6 UMR 7606 CNRS , 4 Place Jussieu, Paris , France 1 Sorbonne-Universités UPMC, LIP6 UMR 7606 CNRS , 4 Place Jussieu, Paris , France 2 Sorbonne-Universités UPMC, LIP6 UMR 7606 CNRS , 4 Place Jussieu, Paris , France

2015

20 2015

We propose to augment rating based recommender systems by providing the user with additional information which might help him in his choice or in the understanding of the recommendation. We consider here as a new task, the generation of personalized reviews associated to items. We use an extractive summary formulation for generating these reviews. We also show that the two information sources, ratings and items could be used both for estimating ratings and for generating summaries, leading to improved performance for each system compared to the use of a single source. Besides these two contributions, we show how a personalized polarity classi er can integrate the rating and textual aspects. Overall, the proposed system o ers the user three personalized hints for a recommendation: rating, text and polarity. We evaluate these three components on two datasets using appropriate measures for each task.

1. INTRODUCTION

The emergence of the participative web has enabled users to easily give their sentiments on many di erent topics. This opinionated data ow thus grows rapidly and o ers opportunities for several applications like e-reputation management or recommendation. Today many e-commerce websites present each item available on their platform with a description of its characteristics, average appreciation, ratings together with individual user reviews explaining their ratings.

Our focus here is on user - item recommendation. This is a multifaceted task where di erent information sources about users and items could be considered and di erent recommendation information could be provided to the user. Despite this diversity, the academic literature on recommender systems has focused only on a few speci c tasks. The most popular one is certainly the prediction of user preferences given their past rating pro le. These systems typically rely on collaborative ltering [ 9 ] to predict missing values in a user/item/rating matrix. In this perspective of rating preInputs User reviews: text and rating U Items s e r s Classic recommender systems

Latent profiles = x

Improving rating predictions ? Rating prediction

Legend

User

Rating

Item

Text User text profiles

Item text profiles ?

Personnalized reviews summary diction, some authors have make use of additional information sources available on typical e-commerce sites. [ 5 ] proposed to extract topics from consumer reviews in order to improve ratings predictions. Recently, [ 11 ] proposed to learn a latent space common to both textual reviews and product ratings, they showed that rating prediction was improved by such hybrid recommender systems. Concerning the information provided to the user, some models exploit review texts for ranking comments that users may like [ 1 ] or for answering speci c user queries [ 17 ].

We start here from the perspective of predicting user preference and argue that the exploitation of the information present in many e-commerce sites, allows us to go beyond simple rating prediction for presenting users with complementary information that may help him making his choice. We consider as an example the generation of a personalized review accompanying each item recommendation. Such a review is a source of complementary evidence for the user appreciation of a suggestion. Similarly as it is done for the ratings, we exploit past information and user similarity in order to generate these reviews. Since pure text generation is a very challenging task [ 2 ], we adopt an extractive summary perspective: the generated text accompanying each rating will be extracted from the reviews of selected users who share similar tastes and appreciations with the target user. Ratings and reviews being correlated, this aspect could also be exploited to improve the predictions. Our rating predictor will make use of user textual pro les extracted from their reviews and summary extraction in turn will use predicted ratings. Thus both types of information, predicted ratings and generated text reviews, are o ered to the user and each prediction, rating and generated text, takes into account the two sources of information. Additional information could also be provided to the user. We show here as an example, that predicted ratings and review texts can be used to train a robust sentiment classi er which provides the user with a personalized polarity indication about the item. The modules of our system are evaluated on the two main tasks, rating prediction and summary extraction, and on the secondary task of sentiment prediction. For this, experiments are conducted on real datasets collected from amazon.com and ratebeer.com and models are compared to classical baselines.

The recommender system is compared to a classic collaborative ltering model using the mean squared error metric. We show that using both ratings and user textual pro les allows us to improve the performance of a baseline recommender. Gains are motivated from a more precise understanding of the key aspects and opinions included in the item and user textual pro les. For evaluating summary text generation associated to a couple (user, item), we have at our disposal a gold standard, the very review text written by this user on the item. Note that this is a rare situation in summary evaluation. However contrarily to collaborative ltering, there is no consensual baseline. We then compare our results to a random model and to oracle optimizing the ROUGE-n metric. They respectively provide a lower and an upper bound of the attainable performance. The sentiment classi er is classically evaluated using classi cation accuracy.

This article is organized as follows. The hybrid formulation, the review generator and the sentiment classi er are presented in section 2. Then, section 3 gives an extensive experimental evaluation of the framework. The overall gains associated to hybrid models are discussed in section 4. A review of related work is provided in section 5.

MODELS

In this section, after introducing the notations used throughout the paper, we will describe successively the three modules of our system. We start by considering the prediction of ratings [ 11 ]. Rating predictors answer the following question: what rating will this user give to this item? We present a simple and e cient way to introduce text pro les representing the writing style and taste of the user in a hybrid formulation. We then show how to exploit reviews and ratings in a new challenging task: what text will this user write about this item? We propose an extractive summary formulation of this task. We then proceed to describe how both ratings and text could be used together in a personalized sentiment classi er. 2.1

Notations

We use u (respectively i) to refer to everything related to a user (respectively to an item) and the rating given by user u to the item i is denoted rui. U and I refer to anything relative to all users and all items, such as the rating matrix RUI . Similarly, lower case letters are used for scalars or vectors and upper case letters for matrices. dui is the actual review text written by user u for item i. It is composed of ui sentences: dui = fsuik; 1 k uig. In this work, we consider documents as bags of sentences. To simplify notations, suik is replaced by sui when there is no ambiguity. Thus, user appreciations are quadruplets (u; i; rui; dui). Recommender systems use past information to compute a rating prediction r^ui, the corresponding prediction function is denoted f (u; i).

For the experiments, ratings and text reviews are split into training, validation and test sets respectively denoted Strain, Sval and Stest and containing mtrain, mval and mtest user appreciations (text and rating). We denote St(rua)in, the subset of all reviews Strain that were written by user u and mt(rua)in the number of such reviews. Similarly, St(ri)ain and mt(ri)ain are used for the reviews on item i. 2.2

Hybrid recommender system with text profiles

Recommender systems classically use rating history to predict the rating r^ui that user u will give to item i. The hybrid system described here makes use of both collaborative ltering through matrix factorization and textual information to produce a rating as described in (1): f (u; i) = + u + i + u: i + g(u; i) (1)

The rst three predictors in equation (1) are biases (overall bias, user bias and item bias). The fourth predictor is a classical matrix factorization term. The novelty of our model comes from the fth term (1) that takes into account text pro les to re ne the prediction f . Our aim for the rating prediction is to minimize the following empirical loss function:

argmin ; u; i; u; i;g

L =

1 mtrain Strain

X (rui f (u; i))2 (2) To simplify the learning procedure, we rst optimize the parameters of the di erent components independently as described in the following subsections. Then we ne tune the combination of these components by learning weighting coe cients so as to maximize the performance criterion (2) on the validation set. 2.2.1

Matrix factorization

We rst compute the di erent bias from eq. (1) as the averaged ratings over their respective domains (overall, user and item). For the matrix factorization term, we approximate the rating matrix RUI using two latent factors: RUI

U IT . Both U and I are two matrices representing collections of latent pro les, with one pro le per row. We denote u (resp. i) the row of U (resp. I ) corresponding to the latent pro le of user u (resp. item i).

The pro les are learned by minimizing, on the training set, the mean squared error between known ratings in matrix RUI and the approximation provided by the factorization

U IT . This minimization problem described in equation (3), with an additional L2 constraint (4) on the factors is solved here using non-negative matrix factorization.

U ; I = argmin

U ; I kMtrain (RUI

U I )k2F + U k U k2F + I k I kF 2 (3)

In this equation Mtrain is a binary mask that has the same dimensions as matrix RUI , an entry is 1 only if the corresponding review is in the training set, is the elementwise product and k kF denotes the Frobenius norm.

Let us denote u the text pro le of user u and t( u0 ; u) a similarity operator between user pro les. The last component of the predictor f in (1) is a weighted average of user ratings for item i, where weight t( u0 ; u) is the similarity between the text pro les u0 and u of users u0 and u, the latter being the target user. This term takes into account the fact that two users with similar styles or using similar expressions in their appreciation of an item, should share close ratings on this item. The prediction term for the user/item couple (u; i) is then expressed as a weighted mean: 1 mt(ri)ain St(ri)ain g(u; i) =

X ru0i t( u0; u) (5)

Two di erent representations for the text pro les u of the users are investigated in this article: one is based on a latent representation of the texts obtained by a neural network autoencoder, the other relies on a robust bag of words coding. Each one is associated to a dedicated metric t.

This leads to two formulations of g, and thus, to two rating prediction models. We denote the former fA (autoencoder) and the latter fT (bag of words). Details are provided below.

Bag of words.

A preprocessing step removes all words appearing in less than 10 documents. Then, the 100 000 most frequent words are kept. Although the number of features is large, the representation is sparse and scales well. u is simply the binary bag of words of all texts of user u. In this high dimensional space, the proximity in style between two users is well described by a cosine function, a high value indicates similar usage of words: t( u0 ; u) = u0 u=(k u0 kk uk) (6)

Autoencoder.

The neural network autoencoder has two components: a coding operator and a decoding operator denoted respectively cod and dec. The two vectorial operators are learned so as to enable the reconstruction of the original text after a projection in the latent space. Namely, given a sentence suik represented as a binary bag of words vector, we obtain a latent pro le suik = cod(suik) and then, we reconstruct an approximation of the sentence using s^uik = dec( suik ).

The autoencoder is optimized so as to minimize the reconstruction error over the training set: cod ; dec = argmin X cod;dec Strain 1 Xui ui k=1 (7)

We use the settings proposed in [ 6 ]: our dictionary is obtained after stopwords removal and selecting the most frequent 5000 words. we did not use a larger dictionary such as the one used for the bag of word representation since it does not lead to improved performance and simply increases the ksuik dec(cod(suik))k2 computational load. All sentences are represented as binary bag of words using this dictionary. The coding dimension has been set to 1000 after a few evaluation trials. Note that the precise value of this latent space is not important and the performance is similar on a large range of dimension values. 1 Both cod and dec use sigmoid units sig(t) = 1+exp( t) : cod(suik) = uik = sig(W suik + b) dec( uik) = sig(W T uik + b0) (8)

Here, uik is a vector, W is a 5000x1000 weight matrix and sig() is a pointwise sigmoid operator operating on the vector W suik + b.

As motivated in [ 11, 5 ], such a latent representation helps exploiting term co-occurrences and thus introduces some semantic. It provides a robust text representation. The hidden activity of this neural network produces a continuous representation for each sentence accounting for the presence or absence of groups of words.

u is obtained by coding the vector corresponding to all text written by the user u in the past. It lies in a latent word space where a low Euclidean distance between users means a similar usage of words. Thus, for the similarity t, we use an inverse Euclidean distance in the latent space: t( u0 ; u) = 1=( + k u0 uk) (9) 2.2.3

Global training criterion for ratings prediction

In order to connect all the elementary components described above with respect to our recommendation task, we introduce (positive) weighting parameters in (1). Thus, the initial optimization problem (2) becomes: = argmin mtr1ain P

Strain rui 1

+ 2 u + 3 i + 4 u: i + 5g(u; i)

The linear combination is optimized using a validation set: this step guaranties that all components are combined in an optimal manner. 2.3

Text generation model

The goal here is to generate a review text for each (u,i) recommendation. During the recommendation process, this text is an additional information for users to consider. It should catch their interest and in principle be close to the one that user u could have written himself on item i. Each text is generated as an extractive summary, where the extracted sentences su0i come from the reviews written by other users (u0 6= u) about item i. Sentence selection is performed according to a criterion which combines a similarity between the sentence and the textual user pro le and a similarity between the actual rating ru0i and the prediction made for (u,i), r^ui computed as described in section 2.2. The former measure could take into account several dimensions like vocabulary, sentiment expression and even style, here it is mainly the vocabulary which is exploited. The latter measures the proximity between user tastes. For the text measure, we make use of the t similarity introduced in section 2.2. As before, we will consider two representations for texts (latent coding and raw bag of words). For the ratings similarity, we use r(ru0i; rui) = 1=(1 + jru0i ruij).

Suppose one wants to select a single sentence for the extracted summary. The sentence selection criterion will then be a simple average of the two similarities: 2 (10) h(su0i; ru0i; u0; u; i) = t(su0i; u) + r(ru0i; r^ui) 2 (11)

Note that this function may score any piece of text. In the following, we then consider three possibilities for generating text reviews: The rst one simply consists in selecting the best sentence su0i among all the training sentences for item i with respect to h. We call it 1S for single sentence. The second one selects a whole review du0i among all the reviews for i. The document is here considered as one long sentence. This is denoted CT for complete text. The third one is a greedy procedure that selects multiple sentences, it is denoted XS. It is initialized with 1S, and then sentences are selected under two criteria: relevance with respect to h and diversity with respect to the sentences already selected. Selection is stopped when the length of the text is greater than the average length of the texts of the target user. Algorithm 1 sums up the XS procedure for generating the text ^ dui for the couple user u, item i.

Data: u, i, S = f(su0i; ru0i u0g Result: d^ui su0i argmax h(su0i; ru0i; u0; u; i) ;

su0i2S d^ui su0i; Remove su0i from S; while length d^ui < averagelength(u) do su0i argmax h(su0i; ru0i; u0; u; i) su0i2S d^ui su0i; Remove su0i from S; cos(su0i; d^ui) ; end Algorithm 1: XS greedy procedure: selection of successive sentences to maximize both relevance and diversity. d^ui is the text that is generated, sentence after sentence. 2.4

Sentiment prediction model

We show here how polarity information about an item can be estimated by exploiting both the user predicted ratings and his textual pro le. Exploiting both information sources improves the sentiment prediction performance compared with a usual text based sentiment classi er.

Polarity classi cation is the task of predicting whether a text dui (here of a review) is positive or negative. We use as ground truth the ratings rui and follow a standard thresholding procedure [ 15 ]: reviews rated 1 or 2 are considered as negative, while items rated 4 or 5 are positive. All texts that are rated 3 are ignored as it is unclear whether that are positive or negative: it strongly depends on the rating habits of the user.

For evaluation purpose, we consider two baselines. A rst one only uses the rating prediction of our recommender system f (u; i) as a label prediction, this value is then thresholded as indicated above. A second one is a classical text sentiment classi er. Denoting by dui the binary bag of word representation of a document and cui the binary label associated to the rating rui, one uses a linear SVM s(dui) = dui:w. Note that this is usually a strong baseline for the polarity classi cation task. Our nal classi er will combine f (u; i) and s(dui) in order to solve the following optimization probSource

Subset names

#Users #Items

All three modules, ratings, text, sentiments, are evaluated independently since there is no global evaluation framework. These individual performances should however provide together a quantitative appreciation of the whole system.

We use two real world datasets of user reviews, collected from amazon.com [ 8 ] and ratebeer.com [ 11 ]. Their characteristics are presented in table 1.

Below, one presents rst how datasets are preprocessed in 3.1. The bene ts of incorporating the text in the ratings prediction for the recommender system are then discussed in section 3.2. The quality of the generated reviews is evaluated and analyzed in section 3.3 Finally, the performance of the sentiment classi er combining text and ratings is described in 3.4. for the autoencoder, we have selected the 5000 most frequent words, with a stopwords removal step; The autoencoder input vector is then a binary vector of dimension 5000. for the raw representation, we have selected the 100000 most frequent words appearing in more than 10 documents (including stopwords) and used a binary vector representation.

For the experiments, we consider several subsets of the databases with di erent numbers of users and items. Each dataset is built by extracting, for a given number of users and items, the most active users and the most commented items. Dataset characteristics are given in table 1.

Recommender system evaluation

Let us rst consider the evaluation of the rating prediction. The metric used here is the mean squared error (MSE) between rating predictions r^ui and actual ratings rui. The lower the MSE is, the better the model is able to estimate the correspondence between user tastes and items. Results are presented in table 2.

The models are referenced using the notations introduced in section 2.2. The rst column corresponds to a trivial system which predicts the overall bias, the second predicts the user bias u. Both give poor performance as expected.

The third column corresponds to the item bias i baseline. It assumes that user taste is not relevant and that each item has its own intrinsic quality. The improvement with respect to and u is important since MSE is halved. The fourth column corresponds to a nonnegative matrix factorization baseline, denoted u: i. It jointly computes latent representations for user tastes and items characteristics. Unsurprisingly, it is our best baseline.

It could be noted that performance tends to degrade when the subset size increases. This is a side e ect associated to the review selection process used for building the di erent datasets. Smaller datasets contain the most active users and the most commented items. The estimation of their pro les bene ts from the high number of reviews per user (and item) in this context.

The last two columns refer to our hybrid recommender systems, using the two text representations introduced in section 2.2. Both fA (autoencoder) and fT (raw text) perform better than a baseline collaborative ltering system and both have similar approximation errors. The main difference between the systems comes from the complexity of the approach: during the learning step, fT is much faster than fA given the fact that no autoencoder optimization is required. On top of that, fT remains faster in the inference step: the inherent sparsity of the bag of word representation enables fT to provide faster computations than fA. The autoencoder works in a smaller dimensional space but it is not sparse. 3.3

Text generation evaluation

We move on now to the evaluation of the personalized review text generation module. Since we are using an extractive summary procedure, we make use of a classical loss used for summarization systems: we use a recall-oriented ROUGE-n metrics, by comparing the generated text against the actual text of the review produced by the user. As far as we know, generating candidate reviews has never been dealt with in this context and this is a novel task. The ROUGE-n metric is the proportion of n-grams of the actual text found in the predicted (candidate) text, we use n = f1; 2; 3g. The higher ROUGE-n is, the better the quality of the candidate text is. A good ROUGE-1 means that topics or vocabulary are correctly caught while ROUGE-2 and ROUGE-3 are more representative of the user's style.

A rst baseline is given by using a random scoring function h (instead of the formulation given in (11)). It provides a lower bound of the performance. Three oracles are then used to provide an upper bound on the performance. They directly optimize the metrics ROUGE-n from the data on the test set.A matrix factorization baseline is also used. It is a special case of our model where no text information is used. This model computes a similar score for all the sentences of a given user and relative to an item. When one sentence only is selected, it is taken at random among the sentences of this user for the item. With greedy selection, the rst sentence is chosen at random and then the cosine diversity term (algorithm 1) allows a ranking of the next candidate sentences. Our proposed method is evaluated with the two di erent user pro le u representation (auto-encoder and raw text). The performance of these seven models on the the two biggest datasets with respect to the three metrics are aggregated in gure 2.

An histogram corresponds to a text selection entity (whole review text, best single sentence, greedy sentence selection. Groups in the histograms (respectively row block of the tables) are composed of three cells corresponding respectively to the ROUGE-1, -2, -3 metrics. Not surprisingly, the results for the single sentence selection procedure (1S) are always worse than for the other two (CT: complete review and XS: multiple sentences). This is simply because a sentence contains fewer words than a full review and it can hardly share more n-grams than the full text with the reference text. For the ratebeer.com datasets, selecting a set of sentences clearly o ers a better performance than selecting a whole review in all cases. Texts written to describe beers also describe the tasting experience. Was it in a bar or at home ? Was it a bottle or on tap ? Texts of the community share the same structure and vocabulary to describe both the tasting and the avors of the beer. Most users write short and precise sentences. This is an appropriate context for our sentence scoring model, where the habits of users are caught by our recommender systems. The performance slightly decreases when the size of the dataset is increased. As before, this is in accordance with the selection procedure of these datasets which focuses rst on the most active users and commented items. For Amazon, the conclusion is not so clear and depending on the conditions, either whole reviews or selected 0.15 0.10 sentences get the best score. It is linked to the higher variety in the community of users on the website: well structured sentences like those present in RateBeer are here mixed here with di erent levels of English and troll reviews.

The di erent models, overall, are following a clear hierarchy. First, stating the obvious, the random model has the worst performance. Then, using a recommender system to select relevant sentences helps in terms of ROUGE-n performance. Using the text information brings most of the time only a small score improvement. Overall our models only o er small improvements here with respect to random or NMF text selection (i.e. based on rating similarity only). After analyzing this behavior, we believe that this is due to the shortness of the text reviews, to their relatively standardized form (arguments are very similar from one review to another), to the peaked vocabulary distribution of the reviews, and to the nature of ROUGE. The latter is a classical recall oriented summarization evaluation measure, but does not distinguishes well between text candidates in this context. This also shows that there is room for improvement on this aspect.

Concerning the oracle several conclusions can be drawn. For both single sentence and complete text selection, the gap between the ROUGE measures and the proposed selection method is important suggesting that there is still room for improvements here too. For the greedy sentence selection, the gap between the oracles and the hybrid recommender systems is moderate suggesting that the procedure is here fully e cient. However this conclusion should be moderated. It can be observed that whereas, ROUGE is e ectively an upper bound for single sentence or whole review selection, this is no more the case for multiple sentences selection. Because of the complexity of selecting the best subset of sentences according to a loss criterion (which amounts at a combinatorial selection problem) we have been using a suboptimal forward selection procedure: we rst select the best ROUGE sentence, then the second best, etc. In this case the ROUGE procedure is no more optimal.

Concerning the measures, the performance decreases rapidly when we move from ROUGE-1 to ROUGE-2, 3. Given the problem formulation and the context of short product reviews, ROUGE-2,3 are clearly too constraining and the corresponding scores are not signi cant. 3.4 Sentiment classification evaluation

The performance of the di erent models, using the sentiment classi cation error as an evaluation metric, are presented in table 3. Because they give very poor performance, the bias recommendation models ( and u) are not presented here. The item bias i, second column, gives a baseline, which is improved by the matrix factorization u: i, third column. Our hybrid models fA, fourth column, and fT , fth column, have lower classi cation errors than all the other recommender systems. The rst column, LL is the linear support vector machine (SVM) baseline. It has been learnt on the training set texts, and the regularization hyperparameter has been selected using the validation set. Our implementation relies on liblinear (LL) [ 4 ].

Its performance is better than the recommender systems but it should be noted that it makes use of the actual text dui of the review, whereas the recommender systems only use past information regarding user u and item i. Note that even in this context, the recommender performance on RateBeer is very close to the SVM baseline.

It is then possible to combine the two models, according to the formulation proposed in section 2.4. The resulting hybrid approaches, denoted LL + fA and LL + fT , exploit both text based decision (SVM) and user pro le (fA and fT ). This combined model shows good classi cation performance and overcomes the LL baseline in 4 out of 7 experiments in table 3, while performing similarly to LL in the other 3 experiments. 4.

OVERALL GAINS

In order to get a global vision of the overall gain provided by the proposed approach, we summarize here the results obtained on the di erent tasks. For each task, the gain with respect to the (task dependent) baseline is computed and averaged (per task) over all datasets. The metric depends on the task. Results are presented in gure 3.

Gain in % w.r.t. random on rouge-n 10 0 10 20 30 40 50 60 70 μ μu μi

Gain in % w.r.t. MSE of γu.γi

fA fT (a) Recommender systems.

Baseline=matrix factorization 0 γu.γi fA

fT rouge-1 rouge-2 rouge-3 (b) Summarizers.

Baseline=random selection procedure

Gain in % w.r.t. % Good classification of LL 70 μi γu.γi fA fT LL + fA LL + fT (c) Opinion classi ers.

Baseline=SVM

For the mean squared error metric ( gure 3a) the matrix factorization is used as baseline. The user bias u heavily fails to generalize on two datasets. The item bias is closer to the baseline ( 11:43%). Our hybrid models, which uses texts to re ne user and item pro les bring a gain of 5:71% for fA, 5:63% for fT . This demonstrates the interest of including textual information in the recommender system. Autoencoder and raw text approaches o er similar gains, the latter approach being overall faster.

For the text generation, we take the random model as baseline and results are presented in gure 3b. The gain is computed for the three investigated framework (CT: review selection, 1S: one sentence selection, XS: multiple sentence selection) and per measure (ROUGE-1, 2, 3) and then averaged to one overall gain. ROUGE-n oracles clearly outperform other models, which seems intuitive. The di erent recommender systems have very close behaviors with respective gains of 11:15% (matrix factorization), 11:89% (auto-encoder), 11:83% (raw text). Here textual information helps but does not clearly dominate ratings and provide only a small improvement. Remember that although performance improvement with respect to baselines is desirable, the main novelty of the approach here is to propose a personalized summary generation together with the usual rating prediction.

For the opinion classi er, presented in gure 3c, the baseline consists in a linear SVM. Basic recommender systems perform poorly with respect to the baseline (LL). Surprisingly, the item bias i ( 68:71%) performs slightly better than matrix factorization u: i ( 69:54%) in the context of sentiment classi cation (no neutral reviews and binary ratings). Using textual information increases the performance. The autoencoder based model fA ( 57:17%) and raw text approach fT ( 58:31%) perform similarly. As discussed in 3.4, the linear SVM uses the text of the current reviews when the recommender systems does not. As a consequence, it is worth combining both predictions in order to exploit text and past pro les: the resulting models give respective gains of 4:72% (autoencoder) and 3:89% (raw text) w.r.t the SVM.

RELATED WORK

Since the paper covers the topics of rating prediction, summarization and sentiment classi cation, we brie y present each of them. 5.1

Recommender systems

Three main families of recommendation algorithms have been developed [ 3 ]: content-based knowledge-based, and collaborative ltering. Given the focus of this work on consumer reviews, we considered collaborative ltering. For merchant websites the goal is to encourage users to buy new products and the problem is usually considered either as the prediction of a ranked list of relevant items for each user [ 13 ] or as the completion of missing ratings [ 9 ]. We have focused here on the latter approach for evaluation concerns: since we use data collected from third party sources. 5.2

Text summarization for consumer reviews Early reference work [ 7 ] on consumer reviews has focused on global summarization of user reviews for each item. The motivation of this work was to extract the sentiments associated to a list of features from all the item review texts. The summarization took the form of a rating or of an appreciation of each feature. Here, contrarily to this line of work, the focus is on personalized item summaries for a target user. Given the di culty of producing a comprehensive synthetic summary, we have turned this problem into a sentence or text selection process.

Evaluation of summaries is challenging: how to assess the quality of a summary when the ground truth is subjective? In our context, the review texts are available and we used them as the ground truth. We have used classical ROUGE-n summary evaluation measures [ 10 ]. 5.3

Sentiment classification

Di erent text latent representations have been proposed in this scope: [ 14 ] proposed a generative model to represent jointly topic and sentiments and recently, several works have considered matrix factorization or neural network, in an attempts to develop robust sentiment recognition systems [ 6 ]. [ 16 ] go further and propose to learn two types of representation: a vectorial model is learned for word representation together with a latent transformation model, which allows the representation of negation and quanti ers associated to an expression.

We have investigated two kinds of representation for the texts: bag of words and a latent representation through the use of autoencoders as in [ 6 ]. [ 11 ] also uses a latent representation for representing reviews, although in a probabilistic setting instead in a deterministic one like we are doing here. 5.4

Hybrid approaches

In the eld of recommendation, a rst hybrid model was proposed by [ 5 ]: it is based on hand labeling of review sentences (topic and polarity) to identify relevant characteristics of the items. [ 11 ] pushes further the exploitation of texts, by using a joint latent representation for ratings and textual content with the objective of improving the rating accuracy. These two works are focused on rating prediction and do not consider delivering additional information to the user. Very recently, [ 19 ] has considered adding an explanation component to a recommender system. For that, they propose to extract some keywords from the review texts, which are supposed to explain why a user likes or dislikes an item. This is probably the work whose spirit is closest to ours but they do not provide a quantitative evaluation.

[ 7 ] combined opinion mining and text summarization on product reviews with the goal of extracting the qualities and defaults. [ 17 ] proposed a system for delivering personalized answers to user queries on speci c products. They built the user pro les relying on topic modeling without any sentiment dimension. [ 1 ] proposed a personalized news recommendation algorithm evaluated on the Yahoo portal using user feedback, but it does not investigate ratings or summarization issues. Overall, we propose in this article to go beyond a generic summary of item characteristics by generating for each user a personalized summaries that is close to what they would have written about the item themselves.

For a long time, sentiment classi cation has ignored the user dimension and has focused for example on the conception of "universal" sentiment classi ers able to deal with a large variety of topics [ 15 ]. Considering the user has become an issue only very recently. [ 18 ] for example exploited explicit relations in social graphs for improving opinion classi ers, but their work is only focused on this aspect. [ 12 ] proposed to distinguish di erent rating behaviors and show that modeling the review authors in a scale ranging from connoisseur to expert o ers a signi cant gain for an opinion prediction task.

In our work, we have experimented the bene ts of considering the text of user reviews in recommender system for their performance as sentiment classi er. We have additionally proposed, as a secondary contribution, an original model mixing recommender systems and linear classi cation.

CONCLUSION

This article proposes an extended framework to the recommendation task. The general goal is to enrich classical recommender systems with several dimensions. As an example we show how to generate personalized reviews for each recommendation using extracted summaries. This is our main contribution. We also show how rating and text could be used to produce e cient personalized sentiment classi ers for each recommendation. Depending on the application, other additional information could be brought to the user. Besides producing additional information for the user, the di erent information sources can take bene t one from the other. We thus show how to e ectively make use of text review and rating informations for building improved rating predictors and review summaries. As already mentioned, the sentiment classi ers also bene ts from the two information sources. This part of the work demonstrates that multiple information sources could be useful for improving recommendation systems. This is particularly interesting since several sources are e ectively available now at many online sites. Several new applications could be developed along the lines suggested here. From a modeling point of view, more sophisticated approaches can be developed. We are currently working on a multitask framework where the representations used in the di erent components are more closely correlated than in the present model.

Acknoledgements The authors would like to thank the AMMICO project (F1302017 Q - FUI AAP 13) for funding our research.

[1]

Agarwal , BC Chen, and

Pang . Personalized recommendation of user comments via factor models . EMNLP'11 , 2011 .

[2]

Amini and

Usunier . A contextual query expansion approach by term clustering for robust text summarization . DUC'07 , 2007 .

[3]

Burke. Hybrid recommender systems: Survey and experiments . UMUAI'02 , 2002 .

[4] R-E Fan , K-W Chang , C-J Hsieh , X-R Wang , and C-J Lin . Liblinear: A library for large linear classi cation . JMLR'08 , 2008 .

[5]

Ganu ,

Elhadad , and

Marian . Beyond the Stars: Improving Rating Predictions using Review Text Content . WebDB'09 , 2009 .

[6]

Glorot ,

Bordes , and

Bengio . Domain adaptation for large-scale sentiment classi cation: A deep learning approach . In ICML'11 , 2011 .

[7]

Minqing

Hu and

Bing

Liu . Mining and summarizing customer reviews . KDD '04, page 168 , 2004 .

[8]

Jindal and

Liu . Opinion spam and analysis . In WSDM , pages 219 { 230 . ACM, 2008 .

[9]

Yehuda

Koren , Robert Bell, and

Chris

Volinsky . Matrix factorization techniques for recommender systems . Computer , pages 42 { 49 , 2009 .

[10] Chin-Yew Lin . Rouge: A package for automatic evaluation of summaries . In ACL Workshop: Text Summarization Branches Out , 2004 .

[11]

McAuley and

Leskovec . Hidden factors and hidden topics: understanding rating dimensions with review text . RecSys'13 , 2013 .

[12]

McAuley and J Leskovec . From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews . WWW'13 , 2013 .

[13] Matthew R McLaughlin and Jonathan L Herlocker . A Collaborative Filtering Algorithm and Evaluation Metric That Accurately Model the User Experience . In SIGIR'04 , 2004 .

[14]

Mei ,

Ling ,

Wondra ,

Su , and CX Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs . In WWW. ACM , 2007 .

[15]

Pang and L Lee . Opinion mining and sentiment analysis . Information Retrieval , 2008 .

[16]

Socher , B Huval , CD Manning, and

Ng . Semantic compositionality through recursive matrix-vector spaces . In EMNLP'12. ACL , 2012 .

[17]

Tan ,

Gabrilovich , and

Pang . To each his own: personalized content selection based on text comprehensibility . In ICWDM'12. ACM , 2012 .

[18]

Tan ,

Lee ,

Tang ,

Jiang ,

Zhou , and

Li . User-level sentiment analysis incorporating social networks . In KDD'11. ACM , 2011 .

[19]

Zhang ,

Lai ,

Zhang ,

Zhang , Y Liu, and

Ma . Explicit factor models for explainable recommendation based on phrase-level sentiment analysis . 2014 .