=Paper=
{{Paper
|id=None
|storemode=property
|title=Hybrid Algorithms for Recommending New Items in Personal TV
|pdfUrl=https://ceur-ws.org/Vol-720/Airoldi.pdf
|volume=Vol-720
}}
==Hybrid Algorithms for Recommending New Items in Personal TV==
Hybrid algorithms for recommending
new items in personal TV
Fabio Airoldi Paolo Cremonesi Roberto Turrin
Politecnico di Milano - DEI Politecnico di Milano - DEI Moviri - R&D
P.zza Leonardo da Vinci, 32 P.zza Leonardo da Vinci, 32 Via Schiaffino, 11
Milano, Italy Milano, Italy Milano, Italy
fabio.airoldi@gmail.com paolo.cremonesi@polimi.it roberto.turrin@moviri.com
ABSTRACT of the preferences of users similar to the target user. How-
Recommending TV programs in the interactive TV domain ever, since no ratings have been collected for new programs,
is a difficult task since the catalog of available items is very they cannot be recommended by collaborative recommender
dynamic, i.e., items are continuously added and removed. systems. This issue is known as the new-item problem.
Despite recommender systems based on collaborative fil- Alternatively to collaborative filtering, recommender sys-
tering typically outperform content-based systems in terms tems can implement content-based filtering algorithms. Since
of recommendation quality, they suffer from the new item content-based approaches base their predictions upon the
problem, i.e., they are not able to recommend items that description of TV programs in terms of features - such as
have few or no ratings. genre and actors - they are not influenced by the lack of rat-
On the contrary, content-based recommender systems are ings. Unfortunately, collaborative algorithms have been sys-
able to recommend both old and new items but the general tematically proven to outperform content-based algorithms
quality of the recommendations in terms of relevance to the in terms of recommendation quality, measured by standard
users is low. accuracy (e.g., recall and precision) and error metrics (e.g.,
In this article we present two different approaches for RMSE and MAE). As an example, if we consider the state-
building hybrid collaborative+content recommender systems, of-the-art recommender algorithms implemented in [2], in
whose purpose is to produce relevant recommendations, while our experiments the collaborative approach reaches a recall
overcoming the new item issue. The approaches are tested equals to 19%, while the content-based approach does not
on the implicit ratings collected from 15’000 IPTV users over go over 3%.
a period of six months. In the settings of iTV, we would like to build a recom-
mender system not affected by the new-item problem, but
with a quality comparable to collaborative filtering. Several
1. INTRODUCTION hybrid algorithms have been proposed in the literature merg-
Interactive television allows providers to deliver to their ing into a unique algorithm both content-based and collabo-
costumers a huge amount of digital content. Since discov- rative filtering. Some of them have been even proven to out-
ering interesting items can be difficult in such wide collec- perform base collaborative recommender systems in therms
tions, recommender systems are used to provide the users of quality. However, the proposed solutions are rarely used
with personalized lists of items that they may like. in production environments mainly because of scalability is-
Recommendations are based on the user preferences, re- sues. Furthermore, most approaches are designed to work
ferred to as ratings, that the system has gathered either ex- in the case of explicit, non-binary ratings, without any par-
plicitly (typically in the 1 . . . 5 scale) or implicitly (typically ticular focus on the new-item problem.
in binary format: 1 if the user likes an item, 0 otherwise). In In this work we present two families of hybrid collabo-
the interactive TV (iTV) domain ratings are often implicit rative+content algorithms which are specifically designed
[2, 13, 12], inferred by tracking the users’ activity (e.g., the to respect the requirements of commercial real-time recom-
purchased movies or the watched TV programs). mender systems. We take into account state-of-the art col-
Item catalogs in the settings of iTV are intrinsically sub- laborative and content-based recommender algorithms typ-
ject to frequent modifications: new programs are usually ically used in the presence of implicit, binary ratings. Each
added as soon as they are available and old content becomes hybrid solutions is composed by one collaborative and one
no longer available. We can identify: content-based algorithm.
The main idea behind the first family of hybrid algorithms
(i) A set of items that are repeated over time, such as tele-
is to augment the existing ratings used for training the col-
vision seasons, weekly talk shows, or reruns of movies.
laborative algorithm with additional ratings estimated with
We can assume that a number of users have watched
the content-based algorithm. Diversely, the second family of
these items, thus providing implicit ratings.
hybrid algorithms merges together the item-to-item similar-
(ii) A set of items that are shown for the first time, such ities computed by the collaborative and the content-based
as the first showing of a movie. We can assume that algorithms.
no ratings have been collected about these items. As a comparison, we also implemented a state-of-the-art
and a trivial hybrid algorithm. The latter simply mixes in
Most recommender systems are based on collaborative fil- a unique recommendation list the items recommended with
tering algorithms, i.e., they recommend items on the basis
the collaborative and those recommended with the content- been rated by many users are more likely to be recom-
based algorithm. mended than items that have few ratings.
The recommendation quality of the hybrid algorithms has
been evaluated in terms of recall against the quality of base- 2.2 Hybrid algorithms
line algorithms on the dataset implicitly collected by an
During the last years several approaches have tried to
IPTV provider over a period of six months. A specific test-
overcome to the drawbacks of single recommender approaches
ing methodology has been designed in order to evaluate the
by combining them into new hybrid recommender algorithms,
quality of recommender algorithm in presence of new items.
from simple implementations (e.g., [34, 25, 10, 6]) up to
In Section 2 we present an overview of the existing panora-
very complex algorithms, such as the BellKor solution win-
ma regarding hybrid recommender systems, with particular
ning the Netflix prize [3], which combines predictions from
focus on the algorithms that are most promising in the dy-
107 different baseline recommender systems [4]. The idea
namic iTV scenario. In Section 3 we describe the reference
of merging multiple predictors has been often used in the
state-of-the-art algorithms we included in our experiments.
settings of machine learning to improve the classification
In Section 4 we illustrate the new algorithms that we devel-
accuracy (e.g., [21]).
oped. Section 5 details the dataset we used for the evalua-
Burke performed an extensive survey and classification of
tion and on the testing methodologies we adopted. Section 6
hybrid recommender systems [6, 7].
shows and discusses the results we obtained. Finally, Section
Mixed algorithms present simultaneously the items rec-
7 draws the conclusions and leads on some possible future
ommended by two different recommender algorithms. As
work.
an example, Smyth and Cotter [34] show in the same inter-
face recommendations generated by a content-based and a
2. RELATED WORK collaborative algorithm.
Recently, several television operators have considered the In case the confidence of a prediction is measurable, it
integration of a recommender system into their architectures is possible to implement a switched algorithm, that selects
in order to provide personalized content (e.g., [2]). the best algorithm to use according to some confidence es-
The need for recommender systems in TV applications timates. For instance, the system ‘NewsDude’ [5] combines
is motivated by the fact that users generally appreciate to a content-based nearest-neighbor recommender, a collabo-
receive personal suggestions generated according to their dy- rative recommender, and naı̈ve Bayes classifier. Ratings are
namically updated profiles, as shown in [26]. The same study predicted by the algorithm which shows the highest confi-
reported also that customers prefer to be recommended with dence and then the user is recommended with the items that
items similar to the ones that they already rated, but also have the highest predictions.
with items that their friends have enjoyed. Weighted algorithms compute a linear combination of the
Recommender systems used to deliver personalized TV ratings predicted by two (or more) recommender algorithms.
experiences are typically content-based [36], collaborative A similar approach has been proposed by Mobasher et. al. [25],
[35] or hybrid solutions [34]. that linearly combines item-to-item similarities generated by
In the next paragraphs we summarize the limitations of different recommender algorithms. Differently from other
non-hybrid recommender systems and we present an overview hybrid solutions, this method can be used on implicit datasets
of the most interesting existing algorithms suitable person- (see Section 3.3 for further details).
alized TV purposes. Meta-level recommender systems use the model produced
by an auxiliary algorithm for training the primary algorithm.
2.1 Drawbacks of standard algorithms As an example, the restaurant recommender proposed by
The two main families of recommender algorithms are the Pazzani in [28] builds a feature-based model of the users
content-based and the collaborative filtering. using content-based information. This model is then used
The former are based on the analysis of the content of by a collaborative algorithm to compute recommendations.
items (e.g., genre and actors). On the contrary, the latter Melville et al. propose a feature-augmentation algorithm
suggest items on the basis of the preferences of users similar denoted “Content boosted collaborative filtering” (CBCF)
to the target user. [24], which Burke’s survey [6, 7] reports as one of the best al-
Collaborative algorithms have recently received much more gorithms. Their approach basically consists in creating ‘aug-
attention than content-based solutions. The main reason is mented user profiles’ by adding ‘pseudo-ratings’ to original
that they usually reach a higher recommendation quality user profiles prior to generating recommendations. Pseudo-
than content-based systems [17, 8]. Furthermore, while col- ratings are generated using a content-based naı̈ve Bayes clas-
laborative algorithms can be easily implemented in any do- sifier and can be interpreted as the ratings that users would
main, content-based system are much more complex because give to unrated items, given the items’ features. Rating
they require to analyze the items’ content (e.g., parsing tex- prediction is computed with a variant of the user-based col-
tual data) [23]. laborative approach, where user-to-user similarities are com-
However, since collaborative algorithms are based on the puted as the Pearson correlation coefficient between the orig-
user ratings they have the following major drawbacks [1]: inal user profiles and the augmented user profiles and the
• New-item. Collaborative filtering is particularly af- weights assigned to pseudo-ratings depend on the number
fected by the new-item problem, being not able to rec- of rated and co-rated items for each user. Melville et. al.
ommend items that have received few or no ratings reported that their algorithm performed better than the
because the system does not have enough information. content-based naı̈ve Bayes and the user-based collaborative
algorithms in terms of MAE (Mean Absolute Error). They
• Popularity bias. Collaborative filtering is biased to- also showed that their algorithm is less susceptible to prob-
wards the most popular items, i.e., items which have lems induced by sparsity: at 99.9% sparsity their algorithm
performed exactly like the content-based baseline. nary ratings, where the value 1 and 0 indicate, respectively,
Regardless several hybrid algorithms have been brought whether the TV program has been watched or not.
to the scientific attention during the last few years, there is Collaborative filtering algorithms analyze the URM to
no exhaustive literature on hybrid recommender algorithms identify similarity between either users (user-based) or items
studied to alleviate the new-item problem, with the only (item-based), or hidden relations between users and items
notable exception being the work of Schein et. al. [33, 32]. (latent factors). Item-based collaborative algorithms have
They describe a generative algorithm with the specific pur- been proved to outperform user-based approaches in terms
pose to have high performance in cold-start situations. In of scalability and recommendation quality [1]. They discover
their work they present a multi-aspect probabilistic model item-to-item similarities using metrics such as the cosine or
which is used to compute the probability that an item is liked the adjusted cosine similarity [30]. Rating prediction for a
or not by a user. The aspects include collaborative data as specific item is computed by considering the ratings given
well as content-based data and one or more latent aspect. by the target user to items similar to it (denoted neighbors).
The hidden aspects are used to model the hypothesis that, Finally, latent factors collaborative algorithms are typically
as an example, a user liked a specific movie because of some based on singular value decomposition (SVD) to extrapolate
particular, latent motivation. This kind of algorithms, along underlying relations between users and items [31, 22].
with association rule mining, neural networks and Boltzman Content-based recommender systems base recommenda-
machines, even if promising, are however difficult to inte- tion on the features extracted from the items’ content, such
grate in existing systems, as they use a completely different as: the actors, the genre, and the summary of a movie. The
approach from traditional collaborative filtering and do not typical bag-of-word approach represents items as vectors of
properly scale with the number of users and items. features stored into a feature-by-item matrix, which is re-
Also, hybrid algorithms are subject to a series of problem- ferred to as Item Content Matrix (ICM) and denoted by W.
atics. First, many of them (e.g., weighted hybrids) need to Items’ content requires to be processed in order to identify
use underlying algorithms that are able to perform rating terms (tokenizing), to remove useless words (stop words),
predictions. This makes them useless on implicit domains to normalize words (stemming), and to weight each feature
like iTV applications: algorithms which rely on predicting (weighting). As for the latter, weights are usually computed
ratings strictly need explicit ratings to work properly as they using the TF-IDF metric [29]. Users can be represented
are usually optimized with respect to error metrics (such as as feature vectors, composed by the features of the movies
MAE or RMSE - e.g., [22]). However, the presence of only they have watched. Rating prediction can be computed as
implicit, binary ratings (as in our datasets) does not allow the similarity between user and item feature vectors.
to use this set of algorithms since we do not have proper In the following section we describe the state-of-the-art
ratings. recommender algorithms we have taken into consideration.
Additionally, some of the algorithms have scalability is- They comprise two collaborative filtering algorithms, one
sues. As an example, CBCF [24] requires augmented user content-based algorithm, and two hybrid algorithms.
profiles in order to produce recommendations. Since it is The algorithms we have considered are designed to be used
impossible to store the augmented URM (it would be a full with implicit, binary datasets. Furthermore, the algorithms
matrix), they need to be generated at real-time. However, share the property of not requiring any user-specific param-
this requires to compute the pseudo ratings for all the user eters to be learned in advance. As a consequence, these
in the neighborhood using a naive bayes classifier. Using the algorithms are able to recommend any user at real-time on
neighborhood size suggested by Melville et. al., this would the basis of his/her most recent ratings.
mean generating 30 recommendation lists using the bayesian The state-of-the-art collaborative and content-based al-
classifier just to produce one hybrid recommendation list. gorithms will be used in Section 4 for defining two families
Other simpler hybrid algorithms belonging to the weighted, of hybrid recommender algorithms, each one combining one
switching, and mixed categories, need to generate at real- content-based and one collaborative filtering: filtered feature
time a number of recommendation lists which is equal to augmentation and similarity injection.
the number of the underlying baseline recommenders. The
consequence of this is that even the simplest algorithm be-
longing to those categories (e.g., the interleaved approach 3.1 Collaborative
described in Section 3.3) would have a halved throughput In the following we present one item-based (NNCosNgbr)
with respect to a non-hybrid solution. and one latent factor (PureSVD) state-of-the-art collabora-
The only state-of-the-art hybrid approach with a complex- tive algorithms suitable for predicting top-N recommenda-
ity comparable to non-hybrid algorithms is SimComb [25]; tions in the case of binary ratings.
however, we did not find any reference of this solution ap-
plied to implicit datasets.
Non-normalized cosine KNN (NNCosNgbr).
This is a state-of-the-art item-based collaborative filtering
3. STATE-OF-THE-ART ALGORITHMS algorithm described in [20]. Note that in the case of binary
Recommender algorithms usually collect user ratings in a ratings we cannot compute similarity metrics such as the
user-by-item matrix, from here on referred to as User Rating Pearson coefficient and the Adjusted Cosine, thus the item-
Matrix (URM) and denoted by R, where rui is the rating item similarity has been measured as the cosine similarity.
given by user u to item i. The u-th row of this matrix is Rating prediction for item i has been computed by sum-
denoted by ru and represents the profile of user u. ming up the similarities between item i and its neighbors,
The URM is typically very sparse (users rate, on aver- where the set of neighbors - denoted by Dk (u; i) - has been
age, only a few items). In the settings of iTV we assume to limited to the k most similar items. This approach is usually
have only implicit feedbacks, so that the URM contains bi- known as knn (k nearest neighborhood). Rating prediction
is computed as: 3.3 Hybrid algorithms
X In the following sections we present two families of hybrid
r̂ui = ruj · sCF
ij (1)
recommender systems based on one collaborative filtering
j∈D k (u;i)
and one content-based filtering. The requirement for an al-
where sCF
ij represents the cosine similarity between item i
gorithm to be used in the second solution is to allow to derive
and j, computed as1 : a similarity measure between items. Thus, all state-of-the
art collaborative and content-based algorithms presented in
ri · r⊺j Sections 3.1 and 3.2 can be used in the following hybrid
sCF
ij = (2)
kri k2 · krj k2 approaches.
This algorithm has been proved to have a good quality, as
shown in [11] and [15]. Interleaved.
This algorithm is a trivial hybrid implementation that
forms the recommendation list to suggest to the target user
PureSVD.
by alternating, in turn, one item predicted by the collab-
PureSVD is a latent factor collaborative algorithm, proved
orative algorithm and one predicted by the content-based
to grant high recommendation quality in terms of recall [11].
algorithm.
Let f denote the number of latent factors used for rep-
resenting users and items. Typical values of f are in the
range [50, 300]. SVD allows to factorize the URM into three SimComb.
f -dimensional matrices - U, Σ, and Q - and to compute a Mobasher et. al. [25] developed a weighted hybrid algo-
f -rank approximation of the original URM: rithm in which item-to-item similarity values are computed
as the linear combination between content-based and collab-
Rf = U · Σ · Q ⊺ (3) orative similarities:
where U and Q are orthonormal matrices and Σ is diagonal. cij = α · sCBF
ij + (1 − α) · sCF
ij (9)
User u and item i are represented by f -dimensional vectors,
respectively: pu and qi . Rating prediction is computed us- where sCBF
ij and sCF
ijare computed, respectively, using (2)
ing the inner product: and (7). Finally, rating prediction can be computed using
(1), where sCF
ij is to be replaced with cij . We refer to this
r̂ui = pu · q⊺i (4) algorithm as simComb.
If we define P = Σ · U, we can observe that, since U and Q
are orthonormal, P = R · Q and (4) can be rewritten as: 4. PROPOSED ALGORITHMS
r̂ui = ru · Q · q⊺i (5) The algorithms we present in the following sections are
designed to overcome some of the limitations of existing al-
Finally, observe that the inner product qi · q⊺j represents gorithms. Our first priority was to design hybrid recom-
the similarity between items i and j. mender systems able to work on implicit, binary ratings,
3.2 Content-based and to grant good recommendation quality even when only
content information is available (new items). In addition,
Differently to collaborative filtering, content-based recom-
we also focus on scalability in order to design algorithms to
mender systems rely only on content features. In the follow-
be used in real iTV scenarios.
ing we describe LSA [2], a well-known approach based on
In the next two sections we present two families of hybrid
SVD.
solutions: Filtered Feature Augmentation and Similarity In-
jection Knn.
Latent Semantic Analysis (LSA). Latent Semantic Anal-
ysis [16, 19, 2] uses SVD to reduce the dimensionality of the 4.1 Filtered Feature Augmentation (FFA)
ICM and to capture underlying relations - like synonymy - Filtered Feature Augmentation is a feature augmentation
between items. The weights in the ICM are computed us- algorithm mainly inspired by CBCF (Content Boosted Col-
ing the TF-IDF metric[29]. Let l be the number of factors laborative Filtering) [24]. Differently from Mellville’s solu-
- typically referred to as latent size - to be used for rep- tion, our approach
resenting items’ features. The ICM can be factorized and
approximated as: 1. does not need any user-specific parameter to be learned;
⊺
Wl = Z · Λ · Y (6) 2. allows to use different content-based and collaborative
Let us define B = YΛ. Similarity between items i and j is filtering;
denoted by sCBF
ij and can be computed as the cosine between
3. computes item-item similarities on the basis of the rat-
vectors bi and bj :
ings augmented with pseudo-ratings derived from the
bi · b⊺j content-based filtering, but use the original user pro-
sCBF
ij = (7) files for predicting ratings (as opposed to computing
kbi k2 · kbj k2
item-item similarities using the original URM and aug-
and rating prediction is computed as: menting user profiles for predicting ratings).
r̂ui = ru · B · b⊺i (8)
Figure 1 shows the learning process: the CBF trains the
1 content-based algorithm and computes the pseudo-ratings
x · y denotes the inner product between vectors x and y,
and kxk2 denotes the Euclidean norm of vector x. for unknown rating values to be added to the original URM.
Table 1: Dataset characteristics: number of different
values for each type of feature (a) and number of
items for each language (b).
(a) Types of features (b) Languages
Type #values Language #items
actors 693 German 703
categories 8 French 84
directors 88 Italian 7
genres 16
Figure 1: Model generation for FFA. The CBF stage languages 3
comprises content-based model generation and rec-
ommendations for every item.
• all the elements sCF
ij ’s are copied into S. Thus, each
items is filled with k similarity values deriving from
The augmented URM (namely, aURM) is used as input for
the collaborative filtering.
the collaborative filtering.
Since adding all pseudo-ratings would lead to a dense, • the elements sCBF ’s are later inserted into the corre-
ij
very large augmented URM, the filter selects only the most sponding empty (e.g., zeros) elements of S in such a
relevant pseudo ratings. In our experiments we used two way to have, for each item, a set of k additional similar-
different filters: a simple one which excludes all the pseudo- ity measures deriving from the content-based filtering.
ratings which are lower than a fixed threshold (FFAt) and a
more sophisticated one which use the Gini impurity measure At the end of the construction, each item will have a total
[14] in order to add both high and low pseudo ratings to in- of 2k similar items:
crease the intrinsic information to the item profiles (FFAg).
The Gini coefficient is defined as: • the k-most collaborative-based similar items
X 2
Gini(v) = 1 − px (10) • the k-most content-based similar items not appearing
x∈v in the previous set
where px is the probability of x in v. In our specific case v .
is an item profile (i.e., a column of the URM) and px = nnx , Exactly half the similarities come from the content-based
where nx is the number of ratings equal to x in v and n technique and half from the collaborative technique.
is the number of ratings in v. When Gini(v) = 0, v is The framework allows different combinations of item-based
pure (and brings almost no information). As we want to collaborative and content-based filtering. In this work we
add informative pseudo-ratings, the filter let only pass the present the results based on LSA as content-based algo-
pseudo-ratings that increment the most the Gini index for rithm and NNCosNgbr as collaborative filtering, since they
each item. This is done until at least g pseudo ratings are reached the highest quality in terms of recall. Thus, sCF
ij and
added to each item-profile. The value of g depends on the sCBF have been computed using (2) and (7), respectively.
ij
number of original ratings for each user profile (denoted by
Finally, rating prediction is estimated with (1), where sCFij
n):
is replaced by sij .
(
nmin − n + (h · n) if n ≤ nmin
g = (h·n2min ) (11)
otherwise 5. DATASET AND EVALUATION METHOD-
n
OLOGY
where nmin and h are parameters. In our experiments we
used the average number of ratings as the value for nmin We evaluated our algorithms using a variant of the method-
and 0.3 for h. ology described in [22, 11], designed to evaluate the perfor-
Rating prediction has been computed by using (1), where mance of recommender algorithms with focus on the new-
rui are the the original user ratings, and sCF item problem.
ij is the similarity
between items i and j computed using (2) on the augmented We used the dataset provided by an IPTV provider2 . It
URM. contains 25765 binary ratings implicitly collected over a pe-
riod of six months given by 15563 users on 794 items (0.0021
4.2 Similarity Injection Knn (simInjKnn) sparsity). Content-based features comprise: actors, direc-
tors, category (e.g., movie, series, documentary. . . ), title,
Similarity Injection Knn builds a model using item-item
genres, summary and language, for a total of 11291 features
similarities obtained by one collaborative and one content-
and an average of 18 features per item. The main charac-
based.
teristics of the available features are summarized in Table
We first compute item-item similarities sCF ij using collabo-
1.
rative filtering, retaining only the k most similar items. Sim-
The evaluation of recommender systems is typically per-
ilarly, we compute item-item similarities sCBF
ij using content-
formed by using either error metrics or accuracy metrics
based filtering.
The similarities are later merged into a unique item-item 2
Ratings refers to the dataset TV2 available at
similarity matrix S, by adopting a two-step process: http://home.dei.polimi.it/cremones/memo/
[18]. Since error metrics - such as MAE (mean absolute
error) and RMSE (root mean square error) - rely on com-
puting the error between actual and predicted ratings, they
cannot be measured on implicit, binary datasets where this
information is not available [20, 15].
For such reason we focus on accuracy metrics, that esti-
mate the fraction of relevant items which are actually recom-
mended (recall) or the fraction of the recommended items
that are actually relevant (precision). In addition, recent
works [11, 27] consider accuracy metrics as more suitable,
with respect to error metrics, for evaluating the top-N rec-
ommendation task [18], i.e., the capability of the recom-
mender system to suggest very limited lists of items that
are likely to be of interest for the users. Figure 2: Sliding window. Blank circles represent
The standard definition of recall - which is typically used ratings in the set T , solid circles represent ratings in
in the settings of information retrieval - is: the test set, and x’s represent ratings in the training
set.
|relevant ∧ retrieved|
recall = (12)
|relevant|
Let M denote the number of items in the dataset. We
Usually, in the settings of recommender systems, the set of randomly divide items in two sets, H1 and H2 , each one
relevant items is composed by items positively rated (e.g., composed by M/2 items. Test set and training set are de-
if the rating is 5 out of 5). However, since we are facing fined as a function of the percentage parameters β as follows:
with an implicit, binary dataset, in our evaluation we have
considered - analogously to other works such as [15, 12, 13] - • we define a set T by randomly selecting 20% of ratings
all rated items to be relevant, as we do not have any further in the URM.
information about the degree of user satisfaction.
• the training set is defined as the set of ratings related
5.1 Performance on new items to items in H1 , excluding all ratings in T .
In order to specifically evaluate the impact of new items on • we form a set H1+2 composed by M/2 items, 100 −
the quality of the different algorithms we developed a testing β% randomly extracted from set H1 and β% randomly
methodology, that we refer to as sliding window, which is an extracted from set H2 .
extension of the evaluation methodology presented in [22,
11]. • test set is composed by the ratings in T related to
The original approach evaluates the quality of recommen- items in H1+2 .
der algorithms by measuring the recall as a function of the
Figure 2 schematically shows how the different sets are for-
number of items displayed to the user (N ). The test consists
med. Blank circles, solid circles, and x’s refer to the ratings,
in excluding a certain amount of ratings (test set) and using
respectively, in the set T , in the test set, and in the training
the remaining ratings to train the algorithms (training set).
set.
All available content-based features are used for training the
For each value of β we have composed a training and a test
content-based and hybrid algorithms. Once an algorithm
set and we have computed the quality of the recommender
has been trained with the ratings in the training set, each
algorithm in terms of recall, as defined in (13), where N
rating rui in the test set is tested as follows:
has been set equal to 20. The test is the same as the one
• we predict the score for all items unrated by user u described for the original evaluation methdology. For each
rating rui in the test set: (i) we predict the score for all items
• we select the top-N items according to the estimated
unrated by user u, (ii) we select the top-20 items according
score
to the estimated score, and (iii) we verify if there has been
• if item i appears in the top-N recommendation list, we a hit, i.e., if item i appears in the top-20.
have a ‘hit’, i.e., the system has correctly suggested a Note that only ratings related to half the items are avail-
relevant item. able for training the algorithms. The parameter β specifies
the percentage of new-items, i.e., the percentage of items
With respect to (12), the set of relevant items corresponds to
that do not have ratings in the training set and so that can-
the test set, while the set of relevant, retrieved items corre-
not be recommended by standard collaborative filtering. In
sponds to the number of hits. Thus, recall can be rewritten
our experiments we varied β from 0% to 100%.
as a function of N , i.e., the number of items displayed to
Collaborative filtering is expected to be able to recom-
users:
mend only items in H2 , so the higher β the lower the quality
#hits
recall(N ) = (13) of the algorithm. When β = 100% we expect the quality of
|test-set| any collaborative filtering to be 0.
Because of the high dataset sparsity, we have formed the On the other hand, content-based algorithms are trained
test set by randomly selecting the 20% of ratings in order to exclusively with content-based features, thus resulting to-
have a significant number of samples. tally independent from ratings included into the training set.
This evaluation methodology is not able to measure the We expect their quality not to be influenced by β. Finally,
quality of the recommendations on new item problem. There- hybrid approaches can be training by the ratings related to
fore we have implemented some modifications. half the items and with all available content-based features.
6. RESULTS AND DISCUSSION In future work we plan to implement between-subjects
In our test we considered the state-of-the-art recommender controlled experiments for a subjective evaluation of the pro-
algorithms described in Section 3 and the hybrid approaches posed solutions. In fact, recent works have pointed out that
proposed in Section 4. As for the former, we included two objective evaluation of recommender systems (i.e., based on
item-based collaborative algorithms – PureSVD and non- error and accuracy metrics) is not always aligned with the
normalized cosine (NNCosNgbr) – a content-based algorithm actual user perception of the recommendation quality, as
- Latent Semantic Analysis (LSA) – and two hybrid algo- measured via controlled experiments (e.g., [9, 27]).
rithms – interleaved and simComb. As for the latter, we
included simInjKnn and two variants of filtered feature aug- 8. REFERENCES
mentation, referred to as FFAg and FFAt. FFAg uses a [1] G. Adomavicius and A. Tuzhilin. Toward the next
filter based on Gini’s impurity measure, while FFAt uses a generation of recommender systems: a survey of the
filter based on a fixed threshold, that has been set to 1, thus state-of-the-art and possible extensions. Knowledge
excluding all the pseudo-ratings lower than 1 (the number and Data Engineering, IEEE Transactions on,
of pseudo-ratings added to the URM was the 27.5% of the 17(6):734–749, 2005.
original number of ratings). [2] R. Bambini, P. Cremonesi, and R. Turrin.
The latent size l of LSA has been set to 300, while the Recommender Systems Handbook, chapter A
number of features f in PureSVD is equal to 50. The neigh- Recommender System for an IPTV Service Provider:
borhood size k has been set equals to 200 for NNCosNgbr. a Real Large-Scale Production Environment. to
As for SimComb, the coefficient α has been set to 0.3 as it appear, Springer, 2010.
empirically provides the better results. Finally, the neigh- [3] R. Bell, Y. Koren, and C. Volinsky. The BellKor
borhood size k used for simInjKnn is 100. solution to the Netflix Prize.
Figure 3 shows the sliding window results, plotting the [4] R. M. Bell and Y. Koren. Scalable Collaborative
recall of the state-of-the-art (a) and the proposed hybrid Filtering with Jointly Derived Neighborhood
(b) algorithms as a function of β, i.e., the percentage of new Interpolation Weights. Seventh IEEE International
items. The recall has been computed assuming N = 20. Conference on Data Mining (ICDM 2007), pages
The most perceptible result is the very poor quality of the 43–52, Oct. 2007.
content-based algorithm, whose recall does not go over 3%,
[5] D. Billsus and M. J. Pazzani. User modeling for
much lower than the best collaborative approach - NNCos-
adaptive news access. User Modeling and
Ngbr - whose recall is about 25% when recommending only
User-Adapted Interaction, 10:147–180, February 2000.
old items. In addition, we can observe that the three hybrid
[6] R. Burke. Hybrid recommender systems: Survey and
algorithms - simComb, FFAt, and simInjKnn - have a recall
higher than collaborative and content-based state-of-the-art experiments. volume 12, pages 331–370, Hingham,
solutions. As for the other two hybrid algorithms: the in- MA, USA, November 2002. Kluwer Academic
Publishers.
terleaved approach confirms to have a poor quality as based
on a very trivial technique, while the lower quality of FFAg [7] R. Burke. The adaptive web. chapter Hybrid web
is motivated by the fact that binary ratings do not bring recommender systems, pages 377–408.
enough information for the filter to work properly. However, Springer-Verlag, Berlin, Heidelberg, 2007.
the results of the sliding window test show that FFAt and [8] L. Candillier, K. Jack, F. Fessant, and F. Meyer.
SimInjKnn outperform the state-of-the-art algorithm Sim- State-of-the-Art Recommender Systems. IGI Global,
Comb when more than 40% of items is new. In addition, we pages 1–22, 2009.
can notice that recall drastically falls down when β = 100%, [9] L. Chen and P. Pu. A user-centric evaluation
with exception for the content-based algorithm that is not framework of recommender systems. In In ACM
influenced by the presence of unrated items. Conference on Recommender Systems (RecSysŠ10),
As a final observation, let us consider the performance im- Workshop on User-Centric Evaluation of
plications of the proposed hybrid solutions. The Similarity Recommender Systems and Their Interfaces
Injection Knn (SimInjKnn) requires to build two item-to- (UCERSTIŠ10)., pages 14–21, Barcelona, Spain, 2010.
item similarity matrices, which is the same effort as required Sept. 26-30, 2010.
for the interleaved algorithm. Filtered Feature Augmenta- [10] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov,
tion (FFAg and FFAt), on the other hand, requires to com- D. Netes, and M. Sartin. Combining content-based
pute pseudo-ratings for each item. According to the number and collaborative filters in an online newspaper. In
of items and users, this can be computationally costly. The Proceedings of ACM SIGIR Workshop on
main benefit of Filtered Feature Augmentation is that it is Recommender Systems, August 1999.
completely independent from the collaborative and content- [11] P. Cremonesi, Y. Koren, and R. Turrin. Performance
based algorithms used, as opposed to Similarity Injection of recommender algorithms on top-n recommendation
Knn, which strictly requires algorithms able to express item- tasks. In RecSys, pages 39–46, 2010.
item similarities. [12] P. Cremonesi and R. Turrin. Analysis of cold-start
recommendations in iptv systems. In RecSys ’09:
Proceedings of the 2009 ACM conference on
7. CONCLUSIONS AND FUTURE WORKS RecommenderSystems, pages 1–4. ACM, 2009.
The new hybrid algorithms we proposed have been proved [13] P. Cremonesi and R. Turrin. Time-evolution of iptv
to improve the overall performance of the state-of-the-art al- recommender systems. In Proc. of the 8th European
gorithms in suggesting new items, though their performance Conference on Interactive TV and Video, Tempere,
is limited by the poor quality of content-based algorithms. Finland, June 2010. ACM.
0.3 0.3
interleaved FFAg
LSA300 FFAt
NNCosNgbr simInjKnn
0.25 simComb 0.25
pureSVD
0.2 0.2
Recall(20)
Recall(20)
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
% of new items(β) % of new items(β)
(a) state-of-the-art algorithms (b) new algorithms
Figure 3: Results of the sliding window test. Figure (a) refers to state-of-the-art approaches, while Figure
(b) to the proposed hybrid solutions.
[14] B. de Ville. Decision trees for business intelligence and pages 426–434, New York, NY, USA, 2008. ACM.
data mining: using SAS enterprise miner. SAS [23] P. Lops, M. Gemmis, and G. Semeraro. Content-based
Publishing, 2006. Recommender Systems: State of the Art and Trends.
[15] M. Deshpande and G. Karypis. Item-based top-n In F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor,
recommendation algorithms. ACM Transactions on editors, Recommender Systems Handbook, chapter 3,
Information Systems (TOIS), 22(1):143–177, 2004. pages 73–105. Springer US, Boston, MA, 2011.
[16] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. [24] P. Melville, R. J. Mooney, and R. Nagarajan.
Landauer, R. A. Harshman, L. A. Streeter, and K. E. Content-boosted collaborative filtering for improved
Lochbaum. Information retrieval using a singular recommendations. In Eighteenth national conference
value decomposition model of latent semantic on Artificial intelligence, pages 187–192, Menlo Park,
structure. pages 465–480, New York, NY, USA, 1988. CA, USA, 2002. American Association for Artificial
ACM Press. Intelligence.
[17] M. A. Ghazanfar and A. Prugel-Bennett. A scalable, [25] B. Mobasher, X. Jin, and Y. Zhou. Semantically
accurate hybrid recommender system. In Proceedings Enhanced Collaborative Filtering on the Web. In
of the 2010 Third International Conference on Proceedings of the 1st European Web Mining Forum
Knowledge Discovery and Data Mining, WKDD ’10, (EWMF2003), pages 57–76, Sept. 2003.
pages 94–98, Washington, DC, USA, 2010. IEEE [26] R. Navarro-Prieto, P. Rebaque-Rivas, and
Computer Society. J. Hernández-Pablo. Recommending content for itv:
[18] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. what the users really want? In Proceedings of the 8th
Evaluating collaborative filtering recommender international interactive conference on Interactive
systems. ACM Transactions on Information Systems TV&Video, EuroITV ’10, pages 123–126, New
(TOIS), 22(1):5–53, 2004. York, NY, USA, 2010. ACM.
[19] P. Husbands, H. Simon, and C. Ding. On the use of [27] S. N. A. P. R. T. Paolo Cremonesi, Franca Garzotto.
singular value decomposition for text retrieval. Oct. Comparative evaluation of recommender system
2000. quality. In CHI extended abstract on Human factors in
[20] G. Karypis. Evaluation of item-based top-n computing systems. ACM - to appear, 2011.
recommendation algorithms. In Proceedings of the [28] M. J. Pazzani. A framework for collaborative,
tenth international conference on Information and content-based and demographic filtering. Artif. Intell.
knowledge management, CIKM ’01, pages 247–254, Rev., 13:393–408, December 1999.
New York, NY, USA, 2001. ACM. [29] G. Salton, editor. Automatic text processing.
[21] J. Kittler, M. Hatef, R. Duin, and J. Matas. On Addison-Wesley Longman Publishing Co., Inc.,
combining classifiers. Pattern Analysis and Machine Boston, MA, USA, 1988.
Intelligence, IEEE Transactions on, 20(3):226 –239, [30] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
mar 1998. Item-based collaborative filtering recommendation
[22] Y. Koren. Factorization meets the neighborhood: a algorithms. 10th Int. Conf. on World Wide Web,
multifaceted collaborative filtering model. In KDD ’08: pages 285–295, 2001.
Proceeding of the 14th ACM SIGKDD international [31] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
conference on Knowledge discovery and data mining, Incremental singular value decomposition algorithms
for highly scalable recommender systems. 5th Int.
Conf. on Computer and Information Technology
(ICCIT 2002), pages 399–404, 2002.
[32] A. Schein, A. Popescul, L. Ungar, and D. Pennock.
Generate models for cold-start recommendations. In
ACMSIGIR Workshop on RecommenderSystems.,
2001.
[33] A. Schein, A. Popescul, L. Ungar, and D. Pennock.
Methods and metrics for cold-start recommendations.
In Proceedings of the 25th Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 2002), pages 253–260,
2002.
[34] B. Smyth and P. Cotter. A personalised tv listings
service for the digital tv age. Knowl.-Based Syst.,
13(2-3):53–59, 2000.
[35] H. Zhang, S. Zheng, and J. Yuan. A personalized tv
guide system compliant with mhp. Consumer
Electronics, IEEE Transactions on, 51(2):731–737,
May 2005.
[36] J. Zimmerman, K. Kurapati, A. L. Buczak,
D. Schaffer, S. Gutta, and J. Martino. Chapter 5 tv
personalization system design of a tv show
recommender engine and interface.