Hybrid algorithms for recommending new items in personal TV Fabio Airoldi Paolo Cremonesi Roberto Turrin Politecnico di Milano - DEI Politecnico di Milano - DEI Moviri - R&D P.zza Leonardo da Vinci, 32 P.zza Leonardo da Vinci, 32 Via Schiaffino, 11 Milano, Italy Milano, Italy Milano, Italy fabio.airoldi@gmail.com paolo.cremonesi@polimi.it roberto.turrin@moviri.com ABSTRACT of the preferences of users similar to the target user. How- Recommending TV programs in the interactive TV domain ever, since no ratings have been collected for new programs, is a difficult task since the catalog of available items is very they cannot be recommended by collaborative recommender dynamic, i.e., items are continuously added and removed. systems. This issue is known as the new-item problem. Despite recommender systems based on collaborative fil- Alternatively to collaborative filtering, recommender sys- tering typically outperform content-based systems in terms tems can implement content-based filtering algorithms. Since of recommendation quality, they suffer from the new item content-based approaches base their predictions upon the problem, i.e., they are not able to recommend items that description of TV programs in terms of features - such as have few or no ratings. genre and actors - they are not influenced by the lack of rat- On the contrary, content-based recommender systems are ings. Unfortunately, collaborative algorithms have been sys- able to recommend both old and new items but the general tematically proven to outperform content-based algorithms quality of the recommendations in terms of relevance to the in terms of recommendation quality, measured by standard users is low. accuracy (e.g., recall and precision) and error metrics (e.g., In this article we present two different approaches for RMSE and MAE). As an example, if we consider the state- building hybrid collaborative+content recommender systems, of-the-art recommender algorithms implemented in [2], in whose purpose is to produce relevant recommendations, while our experiments the collaborative approach reaches a recall overcoming the new item issue. The approaches are tested equals to 19%, while the content-based approach does not on the implicit ratings collected from 15’000 IPTV users over go over 3%. a period of six months. In the settings of iTV, we would like to build a recom- mender system not affected by the new-item problem, but with a quality comparable to collaborative filtering. Several 1. INTRODUCTION hybrid algorithms have been proposed in the literature merg- Interactive television allows providers to deliver to their ing into a unique algorithm both content-based and collabo- costumers a huge amount of digital content. Since discov- rative filtering. Some of them have been even proven to out- ering interesting items can be difficult in such wide collec- perform base collaborative recommender systems in therms tions, recommender systems are used to provide the users of quality. However, the proposed solutions are rarely used with personalized lists of items that they may like. in production environments mainly because of scalability is- Recommendations are based on the user preferences, re- sues. Furthermore, most approaches are designed to work ferred to as ratings, that the system has gathered either ex- in the case of explicit, non-binary ratings, without any par- plicitly (typically in the 1 . . . 5 scale) or implicitly (typically ticular focus on the new-item problem. in binary format: 1 if the user likes an item, 0 otherwise). In In this work we present two families of hybrid collabo- the interactive TV (iTV) domain ratings are often implicit rative+content algorithms which are specifically designed [2, 13, 12], inferred by tracking the users’ activity (e.g., the to respect the requirements of commercial real-time recom- purchased movies or the watched TV programs). mender systems. We take into account state-of-the art col- Item catalogs in the settings of iTV are intrinsically sub- laborative and content-based recommender algorithms typ- ject to frequent modifications: new programs are usually ically used in the presence of implicit, binary ratings. Each added as soon as they are available and old content becomes hybrid solutions is composed by one collaborative and one no longer available. We can identify: content-based algorithm. The main idea behind the first family of hybrid algorithms (i) A set of items that are repeated over time, such as tele- is to augment the existing ratings used for training the col- vision seasons, weekly talk shows, or reruns of movies. laborative algorithm with additional ratings estimated with We can assume that a number of users have watched the content-based algorithm. Diversely, the second family of these items, thus providing implicit ratings. hybrid algorithms merges together the item-to-item similar- (ii) A set of items that are shown for the first time, such ities computed by the collaborative and the content-based as the first showing of a movie. We can assume that algorithms. no ratings have been collected about these items. As a comparison, we also implemented a state-of-the-art and a trivial hybrid algorithm. The latter simply mixes in Most recommender systems are based on collaborative fil- a unique recommendation list the items recommended with tering algorithms, i.e., they recommend items on the basis the collaborative and those recommended with the content- been rated by many users are more likely to be recom- based algorithm. mended than items that have few ratings. The recommendation quality of the hybrid algorithms has been evaluated in terms of recall against the quality of base- 2.2 Hybrid algorithms line algorithms on the dataset implicitly collected by an During the last years several approaches have tried to IPTV provider over a period of six months. A specific test- overcome to the drawbacks of single recommender approaches ing methodology has been designed in order to evaluate the by combining them into new hybrid recommender algorithms, quality of recommender algorithm in presence of new items. from simple implementations (e.g., [34, 25, 10, 6]) up to In Section 2 we present an overview of the existing panora- very complex algorithms, such as the BellKor solution win- ma regarding hybrid recommender systems, with particular ning the Netflix prize [3], which combines predictions from focus on the algorithms that are most promising in the dy- 107 different baseline recommender systems [4]. The idea namic iTV scenario. In Section 3 we describe the reference of merging multiple predictors has been often used in the state-of-the-art algorithms we included in our experiments. settings of machine learning to improve the classification In Section 4 we illustrate the new algorithms that we devel- accuracy (e.g., [21]). oped. Section 5 details the dataset we used for the evalua- Burke performed an extensive survey and classification of tion and on the testing methodologies we adopted. Section 6 hybrid recommender systems [6, 7]. shows and discusses the results we obtained. Finally, Section Mixed algorithms present simultaneously the items rec- 7 draws the conclusions and leads on some possible future ommended by two different recommender algorithms. As work. an example, Smyth and Cotter [34] show in the same inter- face recommendations generated by a content-based and a 2. RELATED WORK collaborative algorithm. Recently, several television operators have considered the In case the confidence of a prediction is measurable, it integration of a recommender system into their architectures is possible to implement a switched algorithm, that selects in order to provide personalized content (e.g., [2]). the best algorithm to use according to some confidence es- The need for recommender systems in TV applications timates. For instance, the system ‘NewsDude’ [5] combines is motivated by the fact that users generally appreciate to a content-based nearest-neighbor recommender, a collabo- receive personal suggestions generated according to their dy- rative recommender, and naı̈ve Bayes classifier. Ratings are namically updated profiles, as shown in [26]. The same study predicted by the algorithm which shows the highest confi- reported also that customers prefer to be recommended with dence and then the user is recommended with the items that items similar to the ones that they already rated, but also have the highest predictions. with items that their friends have enjoyed. Weighted algorithms compute a linear combination of the Recommender systems used to deliver personalized TV ratings predicted by two (or more) recommender algorithms. experiences are typically content-based [36], collaborative A similar approach has been proposed by Mobasher et. al. [25], [35] or hybrid solutions [34]. that linearly combines item-to-item similarities generated by In the next paragraphs we summarize the limitations of different recommender algorithms. Differently from other non-hybrid recommender systems and we present an overview hybrid solutions, this method can be used on implicit datasets of the most interesting existing algorithms suitable person- (see Section 3.3 for further details). alized TV purposes. Meta-level recommender systems use the model produced by an auxiliary algorithm for training the primary algorithm. 2.1 Drawbacks of standard algorithms As an example, the restaurant recommender proposed by The two main families of recommender algorithms are the Pazzani in [28] builds a feature-based model of the users content-based and the collaborative filtering. using content-based information. This model is then used The former are based on the analysis of the content of by a collaborative algorithm to compute recommendations. items (e.g., genre and actors). On the contrary, the latter Melville et al. propose a feature-augmentation algorithm suggest items on the basis of the preferences of users similar denoted “Content boosted collaborative filtering” (CBCF) to the target user. [24], which Burke’s survey [6, 7] reports as one of the best al- Collaborative algorithms have recently received much more gorithms. Their approach basically consists in creating ‘aug- attention than content-based solutions. The main reason is mented user profiles’ by adding ‘pseudo-ratings’ to original that they usually reach a higher recommendation quality user profiles prior to generating recommendations. Pseudo- than content-based systems [17, 8]. Furthermore, while col- ratings are generated using a content-based naı̈ve Bayes clas- laborative algorithms can be easily implemented in any do- sifier and can be interpreted as the ratings that users would main, content-based system are much more complex because give to unrated items, given the items’ features. Rating they require to analyze the items’ content (e.g., parsing tex- prediction is computed with a variant of the user-based col- tual data) [23]. laborative approach, where user-to-user similarities are com- However, since collaborative algorithms are based on the puted as the Pearson correlation coefficient between the orig- user ratings they have the following major drawbacks [1]: inal user profiles and the augmented user profiles and the • New-item. Collaborative filtering is particularly af- weights assigned to pseudo-ratings depend on the number fected by the new-item problem, being not able to rec- of rated and co-rated items for each user. Melville et. al. ommend items that have received few or no ratings reported that their algorithm performed better than the because the system does not have enough information. content-based naı̈ve Bayes and the user-based collaborative algorithms in terms of MAE (Mean Absolute Error). They • Popularity bias. Collaborative filtering is biased to- also showed that their algorithm is less susceptible to prob- wards the most popular items, i.e., items which have lems induced by sparsity: at 99.9% sparsity their algorithm performed exactly like the content-based baseline. nary ratings, where the value 1 and 0 indicate, respectively, Regardless several hybrid algorithms have been brought whether the TV program has been watched or not. to the scientific attention during the last few years, there is Collaborative filtering algorithms analyze the URM to no exhaustive literature on hybrid recommender algorithms identify similarity between either users (user-based) or items studied to alleviate the new-item problem, with the only (item-based), or hidden relations between users and items notable exception being the work of Schein et. al. [33, 32]. (latent factors). Item-based collaborative algorithms have They describe a generative algorithm with the specific pur- been proved to outperform user-based approaches in terms pose to have high performance in cold-start situations. In of scalability and recommendation quality [1]. They discover their work they present a multi-aspect probabilistic model item-to-item similarities using metrics such as the cosine or which is used to compute the probability that an item is liked the adjusted cosine similarity [30]. Rating prediction for a or not by a user. The aspects include collaborative data as specific item is computed by considering the ratings given well as content-based data and one or more latent aspect. by the target user to items similar to it (denoted neighbors). The hidden aspects are used to model the hypothesis that, Finally, latent factors collaborative algorithms are typically as an example, a user liked a specific movie because of some based on singular value decomposition (SVD) to extrapolate particular, latent motivation. This kind of algorithms, along underlying relations between users and items [31, 22]. with association rule mining, neural networks and Boltzman Content-based recommender systems base recommenda- machines, even if promising, are however difficult to inte- tion on the features extracted from the items’ content, such grate in existing systems, as they use a completely different as: the actors, the genre, and the summary of a movie. The approach from traditional collaborative filtering and do not typical bag-of-word approach represents items as vectors of properly scale with the number of users and items. features stored into a feature-by-item matrix, which is re- Also, hybrid algorithms are subject to a series of problem- ferred to as Item Content Matrix (ICM) and denoted by W. atics. First, many of them (e.g., weighted hybrids) need to Items’ content requires to be processed in order to identify use underlying algorithms that are able to perform rating terms (tokenizing), to remove useless words (stop words), predictions. This makes them useless on implicit domains to normalize words (stemming), and to weight each feature like iTV applications: algorithms which rely on predicting (weighting). As for the latter, weights are usually computed ratings strictly need explicit ratings to work properly as they using the TF-IDF metric [29]. Users can be represented are usually optimized with respect to error metrics (such as as feature vectors, composed by the features of the movies MAE or RMSE - e.g., [22]). However, the presence of only they have watched. Rating prediction can be computed as implicit, binary ratings (as in our datasets) does not allow the similarity between user and item feature vectors. to use this set of algorithms since we do not have proper In the following section we describe the state-of-the-art ratings. recommender algorithms we have taken into consideration. Additionally, some of the algorithms have scalability is- They comprise two collaborative filtering algorithms, one sues. As an example, CBCF [24] requires augmented user content-based algorithm, and two hybrid algorithms. profiles in order to produce recommendations. Since it is The algorithms we have considered are designed to be used impossible to store the augmented URM (it would be a full with implicit, binary datasets. Furthermore, the algorithms matrix), they need to be generated at real-time. However, share the property of not requiring any user-specific param- this requires to compute the pseudo ratings for all the user eters to be learned in advance. As a consequence, these in the neighborhood using a naive bayes classifier. Using the algorithms are able to recommend any user at real-time on neighborhood size suggested by Melville et. al., this would the basis of his/her most recent ratings. mean generating 30 recommendation lists using the bayesian The state-of-the-art collaborative and content-based al- classifier just to produce one hybrid recommendation list. gorithms will be used in Section 4 for defining two families Other simpler hybrid algorithms belonging to the weighted, of hybrid recommender algorithms, each one combining one switching, and mixed categories, need to generate at real- content-based and one collaborative filtering: filtered feature time a number of recommendation lists which is equal to augmentation and similarity injection. the number of the underlying baseline recommenders. The consequence of this is that even the simplest algorithm be- longing to those categories (e.g., the interleaved approach 3.1 Collaborative described in Section 3.3) would have a halved throughput In the following we present one item-based (NNCosNgbr) with respect to a non-hybrid solution. and one latent factor (PureSVD) state-of-the-art collabora- The only state-of-the-art hybrid approach with a complex- tive algorithms suitable for predicting top-N recommenda- ity comparable to non-hybrid algorithms is SimComb [25]; tions in the case of binary ratings. however, we did not find any reference of this solution ap- plied to implicit datasets. Non-normalized cosine KNN (NNCosNgbr). This is a state-of-the-art item-based collaborative filtering 3. STATE-OF-THE-ART ALGORITHMS algorithm described in [20]. Note that in the case of binary Recommender algorithms usually collect user ratings in a ratings we cannot compute similarity metrics such as the user-by-item matrix, from here on referred to as User Rating Pearson coefficient and the Adjusted Cosine, thus the item- Matrix (URM) and denoted by R, where rui is the rating item similarity has been measured as the cosine similarity. given by user u to item i. The u-th row of this matrix is Rating prediction for item i has been computed by sum- denoted by ru and represents the profile of user u. ming up the similarities between item i and its neighbors, The URM is typically very sparse (users rate, on aver- where the set of neighbors - denoted by Dk (u; i) - has been age, only a few items). In the settings of iTV we assume to limited to the k most similar items. This approach is usually have only implicit feedbacks, so that the URM contains bi- known as knn (k nearest neighborhood). Rating prediction is computed as: 3.3 Hybrid algorithms X In the following sections we present two families of hybrid r̂ui = ruj · sCF ij (1) recommender systems based on one collaborative filtering j∈D k (u;i) and one content-based filtering. The requirement for an al- where sCF ij represents the cosine similarity between item i gorithm to be used in the second solution is to allow to derive and j, computed as1 : a similarity measure between items. Thus, all state-of-the art collaborative and content-based algorithms presented in ri · r⊺j Sections 3.1 and 3.2 can be used in the following hybrid sCF ij = (2) kri k2 · krj k2 approaches. This algorithm has been proved to have a good quality, as shown in [11] and [15]. Interleaved. This algorithm is a trivial hybrid implementation that forms the recommendation list to suggest to the target user PureSVD. by alternating, in turn, one item predicted by the collab- PureSVD is a latent factor collaborative algorithm, proved orative algorithm and one predicted by the content-based to grant high recommendation quality in terms of recall [11]. algorithm. Let f denote the number of latent factors used for rep- resenting users and items. Typical values of f are in the range [50, 300]. SVD allows to factorize the URM into three SimComb. f -dimensional matrices - U, Σ, and Q - and to compute a Mobasher et. al. [25] developed a weighted hybrid algo- f -rank approximation of the original URM: rithm in which item-to-item similarity values are computed as the linear combination between content-based and collab- Rf = U · Σ · Q ⊺ (3) orative similarities: where U and Q are orthonormal matrices and Σ is diagonal. cij = α · sCBF ij + (1 − α) · sCF ij (9) User u and item i are represented by f -dimensional vectors, respectively: pu and qi . Rating prediction is computed us- where sCBF ij and sCF ijare computed, respectively, using (2) ing the inner product: and (7). Finally, rating prediction can be computed using (1), where sCF ij is to be replaced with cij . We refer to this r̂ui = pu · q⊺i (4) algorithm as simComb. If we define P = Σ · U, we can observe that, since U and Q are orthonormal, P = R · Q and (4) can be rewritten as: 4. PROPOSED ALGORITHMS r̂ui = ru · Q · q⊺i (5) The algorithms we present in the following sections are designed to overcome some of the limitations of existing al- Finally, observe that the inner product qi · q⊺j represents gorithms. Our first priority was to design hybrid recom- the similarity between items i and j. mender systems able to work on implicit, binary ratings, 3.2 Content-based and to grant good recommendation quality even when only content information is available (new items). In addition, Differently to collaborative filtering, content-based recom- we also focus on scalability in order to design algorithms to mender systems rely only on content features. In the follow- be used in real iTV scenarios. ing we describe LSA [2], a well-known approach based on In the next two sections we present two families of hybrid SVD. solutions: Filtered Feature Augmentation and Similarity In- jection Knn. Latent Semantic Analysis (LSA). Latent Semantic Anal- ysis [16, 19, 2] uses SVD to reduce the dimensionality of the 4.1 Filtered Feature Augmentation (FFA) ICM and to capture underlying relations - like synonymy - Filtered Feature Augmentation is a feature augmentation between items. The weights in the ICM are computed us- algorithm mainly inspired by CBCF (Content Boosted Col- ing the TF-IDF metric[29]. Let l be the number of factors laborative Filtering) [24]. Differently from Mellville’s solu- - typically referred to as latent size - to be used for rep- tion, our approach resenting items’ features. The ICM can be factorized and approximated as: 1. does not need any user-specific parameter to be learned; ⊺ Wl = Z · Λ · Y (6) 2. allows to use different content-based and collaborative Let us define B = YΛ. Similarity between items i and j is filtering; denoted by sCBF ij and can be computed as the cosine between 3. computes item-item similarities on the basis of the rat- vectors bi and bj : ings augmented with pseudo-ratings derived from the bi · b⊺j content-based filtering, but use the original user pro- sCBF ij = (7) files for predicting ratings (as opposed to computing kbi k2 · kbj k2 item-item similarities using the original URM and aug- and rating prediction is computed as: menting user profiles for predicting ratings). r̂ui = ru · B · b⊺i (8) Figure 1 shows the learning process: the CBF trains the 1 content-based algorithm and computes the pseudo-ratings x · y denotes the inner product between vectors x and y, and kxk2 denotes the Euclidean norm of vector x. for unknown rating values to be added to the original URM. Table 1: Dataset characteristics: number of different values for each type of feature (a) and number of items for each language (b). (a) Types of features (b) Languages Type #values Language #items actors 693 German 703 categories 8 French 84 directors 88 Italian 7 genres 16 Figure 1: Model generation for FFA. The CBF stage languages 3 comprises content-based model generation and rec- ommendations for every item. • all the elements sCF ij ’s are copied into S. Thus, each items is filled with k similarity values deriving from The augmented URM (namely, aURM) is used as input for the collaborative filtering. the collaborative filtering. Since adding all pseudo-ratings would lead to a dense, • the elements sCBF ’s are later inserted into the corre- ij very large augmented URM, the filter selects only the most sponding empty (e.g., zeros) elements of S in such a relevant pseudo ratings. In our experiments we used two way to have, for each item, a set of k additional similar- different filters: a simple one which excludes all the pseudo- ity measures deriving from the content-based filtering. ratings which are lower than a fixed threshold (FFAt) and a more sophisticated one which use the Gini impurity measure At the end of the construction, each item will have a total [14] in order to add both high and low pseudo ratings to in- of 2k similar items: crease the intrinsic information to the item profiles (FFAg). The Gini coefficient is defined as: • the k-most collaborative-based similar items X 2 Gini(v) = 1 − px (10) • the k-most content-based similar items not appearing x∈v in the previous set where px is the probability of x in v. In our specific case v . is an item profile (i.e., a column of the URM) and px = nnx , Exactly half the similarities come from the content-based where nx is the number of ratings equal to x in v and n technique and half from the collaborative technique. is the number of ratings in v. When Gini(v) = 0, v is The framework allows different combinations of item-based pure (and brings almost no information). As we want to collaborative and content-based filtering. In this work we add informative pseudo-ratings, the filter let only pass the present the results based on LSA as content-based algo- pseudo-ratings that increment the most the Gini index for rithm and NNCosNgbr as collaborative filtering, since they each item. This is done until at least g pseudo ratings are reached the highest quality in terms of recall. Thus, sCF ij and added to each item-profile. The value of g depends on the sCBF have been computed using (2) and (7), respectively. ij number of original ratings for each user profile (denoted by Finally, rating prediction is estimated with (1), where sCFij n): is replaced by sij . ( nmin − n + (h · n) if n ≤ nmin g = (h·n2min ) (11) otherwise 5. DATASET AND EVALUATION METHOD- n OLOGY where nmin and h are parameters. In our experiments we used the average number of ratings as the value for nmin We evaluated our algorithms using a variant of the method- and 0.3 for h. ology described in [22, 11], designed to evaluate the perfor- Rating prediction has been computed by using (1), where mance of recommender algorithms with focus on the new- rui are the the original user ratings, and sCF item problem. ij is the similarity between items i and j computed using (2) on the augmented We used the dataset provided by an IPTV provider2 . It URM. contains 25765 binary ratings implicitly collected over a pe- riod of six months given by 15563 users on 794 items (0.0021 4.2 Similarity Injection Knn (simInjKnn) sparsity). Content-based features comprise: actors, direc- tors, category (e.g., movie, series, documentary. . . ), title, Similarity Injection Knn builds a model using item-item genres, summary and language, for a total of 11291 features similarities obtained by one collaborative and one content- and an average of 18 features per item. The main charac- based. teristics of the available features are summarized in Table We first compute item-item similarities sCF ij using collabo- 1. rative filtering, retaining only the k most similar items. Sim- The evaluation of recommender systems is typically per- ilarly, we compute item-item similarities sCBF ij using content- formed by using either error metrics or accuracy metrics based filtering. The similarities are later merged into a unique item-item 2 Ratings refers to the dataset TV2 available at similarity matrix S, by adopting a two-step process: http://home.dei.polimi.it/cremones/memo/ [18]. Since error metrics - such as MAE (mean absolute error) and RMSE (root mean square error) - rely on com- puting the error between actual and predicted ratings, they cannot be measured on implicit, binary datasets where this information is not available [20, 15]. For such reason we focus on accuracy metrics, that esti- mate the fraction of relevant items which are actually recom- mended (recall) or the fraction of the recommended items that are actually relevant (precision). In addition, recent works [11, 27] consider accuracy metrics as more suitable, with respect to error metrics, for evaluating the top-N rec- ommendation task [18], i.e., the capability of the recom- mender system to suggest very limited lists of items that are likely to be of interest for the users. Figure 2: Sliding window. Blank circles represent The standard definition of recall - which is typically used ratings in the set T , solid circles represent ratings in in the settings of information retrieval - is: the test set, and x’s represent ratings in the training set. |relevant ∧ retrieved| recall = (12) |relevant| Let M denote the number of items in the dataset. We Usually, in the settings of recommender systems, the set of randomly divide items in two sets, H1 and H2 , each one relevant items is composed by items positively rated (e.g., composed by M/2 items. Test set and training set are de- if the rating is 5 out of 5). However, since we are facing fined as a function of the percentage parameters β as follows: with an implicit, binary dataset, in our evaluation we have considered - analogously to other works such as [15, 12, 13] - • we define a set T by randomly selecting 20% of ratings all rated items to be relevant, as we do not have any further in the URM. information about the degree of user satisfaction. • the training set is defined as the set of ratings related 5.1 Performance on new items to items in H1 , excluding all ratings in T . In order to specifically evaluate the impact of new items on • we form a set H1+2 composed by M/2 items, 100 − the quality of the different algorithms we developed a testing β% randomly extracted from set H1 and β% randomly methodology, that we refer to as sliding window, which is an extracted from set H2 . extension of the evaluation methodology presented in [22, 11]. • test set is composed by the ratings in T related to The original approach evaluates the quality of recommen- items in H1+2 . der algorithms by measuring the recall as a function of the Figure 2 schematically shows how the different sets are for- number of items displayed to the user (N ). The test consists med. Blank circles, solid circles, and x’s refer to the ratings, in excluding a certain amount of ratings (test set) and using respectively, in the set T , in the test set, and in the training the remaining ratings to train the algorithms (training set). set. All available content-based features are used for training the For each value of β we have composed a training and a test content-based and hybrid algorithms. Once an algorithm set and we have computed the quality of the recommender has been trained with the ratings in the training set, each algorithm in terms of recall, as defined in (13), where N rating rui in the test set is tested as follows: has been set equal to 20. The test is the same as the one • we predict the score for all items unrated by user u described for the original evaluation methdology. For each rating rui in the test set: (i) we predict the score for all items • we select the top-N items according to the estimated unrated by user u, (ii) we select the top-20 items according score to the estimated score, and (iii) we verify if there has been • if item i appears in the top-N recommendation list, we a hit, i.e., if item i appears in the top-20. have a ‘hit’, i.e., the system has correctly suggested a Note that only ratings related to half the items are avail- relevant item. able for training the algorithms. The parameter β specifies the percentage of new-items, i.e., the percentage of items With respect to (12), the set of relevant items corresponds to that do not have ratings in the training set and so that can- the test set, while the set of relevant, retrieved items corre- not be recommended by standard collaborative filtering. In sponds to the number of hits. Thus, recall can be rewritten our experiments we varied β from 0% to 100%. as a function of N , i.e., the number of items displayed to Collaborative filtering is expected to be able to recom- users: mend only items in H2 , so the higher β the lower the quality #hits recall(N ) = (13) of the algorithm. When β = 100% we expect the quality of |test-set| any collaborative filtering to be 0. Because of the high dataset sparsity, we have formed the On the other hand, content-based algorithms are trained test set by randomly selecting the 20% of ratings in order to exclusively with content-based features, thus resulting to- have a significant number of samples. tally independent from ratings included into the training set. This evaluation methodology is not able to measure the We expect their quality not to be influenced by β. Finally, quality of the recommendations on new item problem. There- hybrid approaches can be training by the ratings related to fore we have implemented some modifications. half the items and with all available content-based features. 6. RESULTS AND DISCUSSION In future work we plan to implement between-subjects In our test we considered the state-of-the-art recommender controlled experiments for a subjective evaluation of the pro- algorithms described in Section 3 and the hybrid approaches posed solutions. In fact, recent works have pointed out that proposed in Section 4. As for the former, we included two objective evaluation of recommender systems (i.e., based on item-based collaborative algorithms – PureSVD and non- error and accuracy metrics) is not always aligned with the normalized cosine (NNCosNgbr) – a content-based algorithm actual user perception of the recommendation quality, as - Latent Semantic Analysis (LSA) – and two hybrid algo- measured via controlled experiments (e.g., [9, 27]). rithms – interleaved and simComb. As for the latter, we included simInjKnn and two variants of filtered feature aug- 8. REFERENCES mentation, referred to as FFAg and FFAt. FFAg uses a [1] G. Adomavicius and A. Tuzhilin. Toward the next filter based on Gini’s impurity measure, while FFAt uses a generation of recommender systems: a survey of the filter based on a fixed threshold, that has been set to 1, thus state-of-the-art and possible extensions. Knowledge excluding all the pseudo-ratings lower than 1 (the number and Data Engineering, IEEE Transactions on, of pseudo-ratings added to the URM was the 27.5% of the 17(6):734–749, 2005. original number of ratings). [2] R. Bambini, P. Cremonesi, and R. Turrin. The latent size l of LSA has been set to 300, while the Recommender Systems Handbook, chapter A number of features f in PureSVD is equal to 50. The neigh- Recommender System for an IPTV Service Provider: borhood size k has been set equals to 200 for NNCosNgbr. a Real Large-Scale Production Environment. to As for SimComb, the coefficient α has been set to 0.3 as it appear, Springer, 2010. empirically provides the better results. Finally, the neigh- [3] R. Bell, Y. Koren, and C. Volinsky. The BellKor borhood size k used for simInjKnn is 100. solution to the Netflix Prize. Figure 3 shows the sliding window results, plotting the [4] R. M. Bell and Y. Koren. Scalable Collaborative recall of the state-of-the-art (a) and the proposed hybrid Filtering with Jointly Derived Neighborhood (b) algorithms as a function of β, i.e., the percentage of new Interpolation Weights. Seventh IEEE International items. The recall has been computed assuming N = 20. Conference on Data Mining (ICDM 2007), pages The most perceptible result is the very poor quality of the 43–52, Oct. 2007. content-based algorithm, whose recall does not go over 3%, [5] D. Billsus and M. J. Pazzani. User modeling for much lower than the best collaborative approach - NNCos- adaptive news access. User Modeling and Ngbr - whose recall is about 25% when recommending only User-Adapted Interaction, 10:147–180, February 2000. old items. In addition, we can observe that the three hybrid [6] R. Burke. Hybrid recommender systems: Survey and algorithms - simComb, FFAt, and simInjKnn - have a recall higher than collaborative and content-based state-of-the-art experiments. volume 12, pages 331–370, Hingham, solutions. As for the other two hybrid algorithms: the in- MA, USA, November 2002. Kluwer Academic Publishers. terleaved approach confirms to have a poor quality as based on a very trivial technique, while the lower quality of FFAg [7] R. Burke. The adaptive web. chapter Hybrid web is motivated by the fact that binary ratings do not bring recommender systems, pages 377–408. enough information for the filter to work properly. However, Springer-Verlag, Berlin, Heidelberg, 2007. the results of the sliding window test show that FFAt and [8] L. Candillier, K. Jack, F. Fessant, and F. Meyer. SimInjKnn outperform the state-of-the-art algorithm Sim- State-of-the-Art Recommender Systems. IGI Global, Comb when more than 40% of items is new. In addition, we pages 1–22, 2009. can notice that recall drastically falls down when β = 100%, [9] L. Chen and P. Pu. A user-centric evaluation with exception for the content-based algorithm that is not framework of recommender systems. In In ACM influenced by the presence of unrated items. Conference on Recommender Systems (RecSysŠ10), As a final observation, let us consider the performance im- Workshop on User-Centric Evaluation of plications of the proposed hybrid solutions. The Similarity Recommender Systems and Their Interfaces Injection Knn (SimInjKnn) requires to build two item-to- (UCERSTIŠ10)., pages 14–21, Barcelona, Spain, 2010. item similarity matrices, which is the same effort as required Sept. 26-30, 2010. for the interleaved algorithm. Filtered Feature Augmenta- [10] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, tion (FFAg and FFAt), on the other hand, requires to com- D. Netes, and M. Sartin. Combining content-based pute pseudo-ratings for each item. According to the number and collaborative filters in an online newspaper. In of items and users, this can be computationally costly. The Proceedings of ACM SIGIR Workshop on main benefit of Filtered Feature Augmentation is that it is Recommender Systems, August 1999. completely independent from the collaborative and content- [11] P. Cremonesi, Y. Koren, and R. Turrin. Performance based algorithms used, as opposed to Similarity Injection of recommender algorithms on top-n recommendation Knn, which strictly requires algorithms able to express item- tasks. In RecSys, pages 39–46, 2010. item similarities. [12] P. Cremonesi and R. Turrin. Analysis of cold-start recommendations in iptv systems. In RecSys ’09: Proceedings of the 2009 ACM conference on 7. CONCLUSIONS AND FUTURE WORKS RecommenderSystems, pages 1–4. ACM, 2009. The new hybrid algorithms we proposed have been proved [13] P. Cremonesi and R. Turrin. Time-evolution of iptv to improve the overall performance of the state-of-the-art al- recommender systems. In Proc. of the 8th European gorithms in suggesting new items, though their performance Conference on Interactive TV and Video, Tempere, is limited by the poor quality of content-based algorithms. Finland, June 2010. ACM. 0.3 0.3 interleaved FFAg LSA300 FFAt NNCosNgbr simInjKnn 0.25 simComb 0.25 pureSVD 0.2 0.2 Recall(20) Recall(20) 0.15 0.15 0.1 0.1 0.05 0.05 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 % of new items(β) % of new items(β) (a) state-of-the-art algorithms (b) new algorithms Figure 3: Results of the sliding window test. Figure (a) refers to state-of-the-art approaches, while Figure (b) to the proposed hybrid solutions. [14] B. de Ville. Decision trees for business intelligence and pages 426–434, New York, NY, USA, 2008. ACM. data mining: using SAS enterprise miner. SAS [23] P. Lops, M. Gemmis, and G. Semeraro. Content-based Publishing, 2006. Recommender Systems: State of the Art and Trends. [15] M. Deshpande and G. Karypis. Item-based top-n In F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, recommendation algorithms. ACM Transactions on editors, Recommender Systems Handbook, chapter 3, Information Systems (TOIS), 22(1):143–177, 2004. pages 73–105. Springer US, Boston, MA, 2011. [16] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. [24] P. Melville, R. J. Mooney, and R. Nagarajan. Landauer, R. A. Harshman, L. A. Streeter, and K. E. Content-boosted collaborative filtering for improved Lochbaum. Information retrieval using a singular recommendations. In Eighteenth national conference value decomposition model of latent semantic on Artificial intelligence, pages 187–192, Menlo Park, structure. pages 465–480, New York, NY, USA, 1988. CA, USA, 2002. American Association for Artificial ACM Press. Intelligence. [17] M. A. Ghazanfar and A. Prugel-Bennett. A scalable, [25] B. Mobasher, X. Jin, and Y. Zhou. Semantically accurate hybrid recommender system. In Proceedings Enhanced Collaborative Filtering on the Web. In of the 2010 Third International Conference on Proceedings of the 1st European Web Mining Forum Knowledge Discovery and Data Mining, WKDD ’10, (EWMF2003), pages 57–76, Sept. 2003. pages 94–98, Washington, DC, USA, 2010. IEEE [26] R. Navarro-Prieto, P. Rebaque-Rivas, and Computer Society. J. Hernández-Pablo. Recommending content for itv: [18] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. what the users really want? In Proceedings of the 8th Evaluating collaborative filtering recommender international interactive conference on Interactive systems. ACM Transactions on Information Systems TV&Video, EuroITV ’10, pages 123–126, New (TOIS), 22(1):5–53, 2004. York, NY, USA, 2010. ACM. [19] P. Husbands, H. Simon, and C. Ding. On the use of [27] S. N. A. P. R. T. Paolo Cremonesi, Franca Garzotto. singular value decomposition for text retrieval. Oct. Comparative evaluation of recommender system 2000. quality. In CHI extended abstract on Human factors in [20] G. Karypis. Evaluation of item-based top-n computing systems. ACM - to appear, 2011. recommendation algorithms. In Proceedings of the [28] M. J. Pazzani. A framework for collaborative, tenth international conference on Information and content-based and demographic filtering. Artif. Intell. knowledge management, CIKM ’01, pages 247–254, Rev., 13:393–408, December 1999. New York, NY, USA, 2001. ACM. [29] G. Salton, editor. Automatic text processing. [21] J. Kittler, M. Hatef, R. Duin, and J. Matas. On Addison-Wesley Longman Publishing Co., Inc., combining classifiers. Pattern Analysis and Machine Boston, MA, USA, 1988. Intelligence, IEEE Transactions on, 20(3):226 –239, [30] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl. mar 1998. Item-based collaborative filtering recommendation [22] Y. Koren. Factorization meets the neighborhood: a algorithms. 10th Int. Conf. on World Wide Web, multifaceted collaborative filtering model. In KDD ’08: pages 285–295, 2001. Proceeding of the 14th ACM SIGKDD international [31] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. conference on Knowledge discovery and data mining, Incremental singular value decomposition algorithms for highly scalable recommender systems. 5th Int. Conf. on Computer and Information Technology (ICCIT 2002), pages 399–404, 2002. [32] A. Schein, A. Popescul, L. Ungar, and D. Pennock. Generate models for cold-start recommendations. In ACMSIGIR Workshop on RecommenderSystems., 2001. [33] A. Schein, A. Popescul, L. Ungar, and D. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), pages 253–260, 2002. [34] B. Smyth and P. Cotter. A personalised tv listings service for the digital tv age. Knowl.-Based Syst., 13(2-3):53–59, 2000. [35] H. Zhang, S. Zheng, and J. Yuan. A personalized tv guide system compliant with mhp. Consumer Electronics, IEEE Transactions on, 51(2):731–737, May 2005. [36] J. Zimmerman, K. Kurapati, A. L. Buczak, D. Schaffer, S. Gutta, and J. Martino. Chapter 5 tv personalization system design of a tv show recommender engine and interface.