Hybrid algorithms for recommending
                               new items in personal TV

                    Fabio Airoldi                         Paolo Cremonesi                      Roberto Turrin
             Politecnico di Milano - DEI              Politecnico di Milano - DEI                Moviri - R&D
            P.zza Leonardo da Vinci, 32              P.zza Leonardo da Vinci, 32               Via Schiaffino, 11
                    Milano, Italy                            Milano, Italy                        Milano, Italy
           fabio.airoldi@gmail.com                  paolo.cremonesi@polimi.it roberto.turrin@moviri.com

ABSTRACT                                                               of the preferences of users similar to the target user. How-
Recommending TV programs in the interactive TV domain                  ever, since no ratings have been collected for new programs,
is a difficult task since the catalog of available items is very       they cannot be recommended by collaborative recommender
dynamic, i.e., items are continuously added and removed.               systems. This issue is known as the new-item problem.
   Despite recommender systems based on collaborative fil-                Alternatively to collaborative filtering, recommender sys-
tering typically outperform content-based systems in terms             tems can implement content-based filtering algorithms. Since
of recommendation quality, they suffer from the new item               content-based approaches base their predictions upon the
problem, i.e., they are not able to recommend items that               description of TV programs in terms of features - such as
have few or no ratings.                                                genre and actors - they are not influenced by the lack of rat-
   On the contrary, content-based recommender systems are              ings. Unfortunately, collaborative algorithms have been sys-
able to recommend both old and new items but the general               tematically proven to outperform content-based algorithms
quality of the recommendations in terms of relevance to the            in terms of recommendation quality, measured by standard
users is low.                                                          accuracy (e.g., recall and precision) and error metrics (e.g.,
   In this article we present two different approaches for             RMSE and MAE). As an example, if we consider the state-
building hybrid collaborative+content recommender systems,             of-the-art recommender algorithms implemented in [2], in
whose purpose is to produce relevant recommendations, while            our experiments the collaborative approach reaches a recall
overcoming the new item issue. The approaches are tested               equals to 19%, while the content-based approach does not
on the implicit ratings collected from 15’000 IPTV users over          go over 3%.
a period of six months.                                                   In the settings of iTV, we would like to build a recom-
                                                                       mender system not affected by the new-item problem, but
                                                                       with a quality comparable to collaborative filtering. Several
1. INTRODUCTION                                                        hybrid algorithms have been proposed in the literature merg-
   Interactive television allows providers to deliver to their         ing into a unique algorithm both content-based and collabo-
costumers a huge amount of digital content. Since discov-              rative filtering. Some of them have been even proven to out-
ering interesting items can be difficult in such wide collec-          perform base collaborative recommender systems in therms
tions, recommender systems are used to provide the users               of quality. However, the proposed solutions are rarely used
with personalized lists of items that they may like.                   in production environments mainly because of scalability is-
   Recommendations are based on the user preferences, re-              sues. Furthermore, most approaches are designed to work
ferred to as ratings, that the system has gathered either ex-          in the case of explicit, non-binary ratings, without any par-
plicitly (typically in the 1 . . . 5 scale) or implicitly (typically   ticular focus on the new-item problem.
in binary format: 1 if the user likes an item, 0 otherwise). In           In this work we present two families of hybrid collabo-
the interactive TV (iTV) domain ratings are often implicit             rative+content algorithms which are specifically designed
[2, 13, 12], inferred by tracking the users’ activity (e.g., the       to respect the requirements of commercial real-time recom-
purchased movies or the watched TV programs).                          mender systems. We take into account state-of-the art col-
   Item catalogs in the settings of iTV are intrinsically sub-         laborative and content-based recommender algorithms typ-
ject to frequent modifications: new programs are usually               ically used in the presence of implicit, binary ratings. Each
added as soon as they are available and old content becomes            hybrid solutions is composed by one collaborative and one
no longer available. We can identify:                                  content-based algorithm.
                                                                          The main idea behind the first family of hybrid algorithms
  (i) A set of items that are repeated over time, such as tele-
                                                                       is to augment the existing ratings used for training the col-
      vision seasons, weekly talk shows, or reruns of movies.
                                                                       laborative algorithm with additional ratings estimated with
      We can assume that a number of users have watched
                                                                       the content-based algorithm. Diversely, the second family of
      these items, thus providing implicit ratings.
                                                                       hybrid algorithms merges together the item-to-item similar-
 (ii) A set of items that are shown for the first time, such           ities computed by the collaborative and the content-based
      as the first showing of a movie. We can assume that              algorithms.
      no ratings have been collected about these items.                   As a comparison, we also implemented a state-of-the-art
                                                                       and a trivial hybrid algorithm. The latter simply mixes in
  Most recommender systems are based on collaborative fil-             a unique recommendation list the items recommended with
tering algorithms, i.e., they recommend items on the basis
the collaborative and those recommended with the content-              been rated by many users are more likely to be recom-
based algorithm.                                                       mended than items that have few ratings.
   The recommendation quality of the hybrid algorithms has
been evaluated in terms of recall against the quality of base-   2.2    Hybrid algorithms
line algorithms on the dataset implicitly collected by an
                                                                    During the last years several approaches have tried to
IPTV provider over a period of six months. A specific test-
                                                                 overcome to the drawbacks of single recommender approaches
ing methodology has been designed in order to evaluate the
                                                                 by combining them into new hybrid recommender algorithms,
quality of recommender algorithm in presence of new items.
                                                                 from simple implementations (e.g., [34, 25, 10, 6]) up to
   In Section 2 we present an overview of the existing panora-
                                                                 very complex algorithms, such as the BellKor solution win-
ma regarding hybrid recommender systems, with particular
                                                                 ning the Netflix prize [3], which combines predictions from
focus on the algorithms that are most promising in the dy-
                                                                 107 different baseline recommender systems [4]. The idea
namic iTV scenario. In Section 3 we describe the reference
                                                                 of merging multiple predictors has been often used in the
state-of-the-art algorithms we included in our experiments.
                                                                 settings of machine learning to improve the classification
In Section 4 we illustrate the new algorithms that we devel-
                                                                 accuracy (e.g., [21]).
oped. Section 5 details the dataset we used for the evalua-
                                                                    Burke performed an extensive survey and classification of
tion and on the testing methodologies we adopted. Section 6
                                                                 hybrid recommender systems [6, 7].
shows and discusses the results we obtained. Finally, Section
                                                                    Mixed algorithms present simultaneously the items rec-
7 draws the conclusions and leads on some possible future
                                                                 ommended by two different recommender algorithms. As
work.
                                                                 an example, Smyth and Cotter [34] show in the same inter-
                                                                 face recommendations generated by a content-based and a
2. RELATED WORK                                                  collaborative algorithm.
   Recently, several television operators have considered the       In case the confidence of a prediction is measurable, it
integration of a recommender system into their architectures     is possible to implement a switched algorithm, that selects
in order to provide personalized content (e.g., [2]).            the best algorithm to use according to some confidence es-
   The need for recommender systems in TV applications           timates. For instance, the system ‘NewsDude’ [5] combines
is motivated by the fact that users generally appreciate to      a content-based nearest-neighbor recommender, a collabo-
receive personal suggestions generated according to their dy-    rative recommender, and naı̈ve Bayes classifier. Ratings are
namically updated profiles, as shown in [26]. The same study     predicted by the algorithm which shows the highest confi-
reported also that customers prefer to be recommended with       dence and then the user is recommended with the items that
items similar to the ones that they already rated, but also      have the highest predictions.
with items that their friends have enjoyed.                         Weighted algorithms compute a linear combination of the
   Recommender systems used to deliver personalized TV           ratings predicted by two (or more) recommender algorithms.
experiences are typically content-based [36], collaborative      A similar approach has been proposed by Mobasher et. al. [25],
[35] or hybrid solutions [34].                                   that linearly combines item-to-item similarities generated by
   In the next paragraphs we summarize the limitations of        different recommender algorithms. Differently from other
non-hybrid recommender systems and we present an overview        hybrid solutions, this method can be used on implicit datasets
of the most interesting existing algorithms suitable person-     (see Section 3.3 for further details).
alized TV purposes.                                                 Meta-level recommender systems use the model produced
                                                                 by an auxiliary algorithm for training the primary algorithm.
2.1 Drawbacks of standard algorithms                             As an example, the restaurant recommender proposed by
   The two main families of recommender algorithms are the       Pazzani in [28] builds a feature-based model of the users
content-based and the collaborative filtering.                   using content-based information. This model is then used
   The former are based on the analysis of the content of        by a collaborative algorithm to compute recommendations.
items (e.g., genre and actors). On the contrary, the latter         Melville et al. propose a feature-augmentation algorithm
suggest items on the basis of the preferences of users similar   denoted “Content boosted collaborative filtering” (CBCF)
to the target user.                                              [24], which Burke’s survey [6, 7] reports as one of the best al-
   Collaborative algorithms have recently received much more     gorithms. Their approach basically consists in creating ‘aug-
attention than content-based solutions. The main reason is       mented user profiles’ by adding ‘pseudo-ratings’ to original
that they usually reach a higher recommendation quality          user profiles prior to generating recommendations. Pseudo-
than content-based systems [17, 8]. Furthermore, while col-      ratings are generated using a content-based naı̈ve Bayes clas-
laborative algorithms can be easily implemented in any do-       sifier and can be interpreted as the ratings that users would
main, content-based system are much more complex because         give to unrated items, given the items’ features. Rating
they require to analyze the items’ content (e.g., parsing tex-   prediction is computed with a variant of the user-based col-
tual data) [23].                                                 laborative approach, where user-to-user similarities are com-
   However, since collaborative algorithms are based on the      puted as the Pearson correlation coefficient between the orig-
user ratings they have the following major drawbacks [1]:        inal user profiles and the augmented user profiles and the
   • New-item. Collaborative filtering is particularly af-       weights assigned to pseudo-ratings depend on the number
     fected by the new-item problem, being not able to rec-      of rated and co-rated items for each user. Melville et. al.
     ommend items that have received few or no ratings           reported that their algorithm performed better than the
     because the system does not have enough information.        content-based naı̈ve Bayes and the user-based collaborative
                                                                 algorithms in terms of MAE (Mean Absolute Error). They
   • Popularity bias. Collaborative filtering is biased to-      also showed that their algorithm is less susceptible to prob-
     wards the most popular items, i.e., items which have        lems induced by sparsity: at 99.9% sparsity their algorithm
performed exactly like the content-based baseline.                nary ratings, where the value 1 and 0 indicate, respectively,
   Regardless several hybrid algorithms have been brought         whether the TV program has been watched or not.
to the scientific attention during the last few years, there is      Collaborative filtering algorithms analyze the URM to
no exhaustive literature on hybrid recommender algorithms         identify similarity between either users (user-based) or items
studied to alleviate the new-item problem, with the only          (item-based), or hidden relations between users and items
notable exception being the work of Schein et. al. [33, 32].      (latent factors). Item-based collaborative algorithms have
They describe a generative algorithm with the specific pur-       been proved to outperform user-based approaches in terms
pose to have high performance in cold-start situations. In        of scalability and recommendation quality [1]. They discover
their work they present a multi-aspect probabilistic model        item-to-item similarities using metrics such as the cosine or
which is used to compute the probability that an item is liked    the adjusted cosine similarity [30]. Rating prediction for a
or not by a user. The aspects include collaborative data as       specific item is computed by considering the ratings given
well as content-based data and one or more latent aspect.         by the target user to items similar to it (denoted neighbors).
The hidden aspects are used to model the hypothesis that,         Finally, latent factors collaborative algorithms are typically
as an example, a user liked a specific movie because of some      based on singular value decomposition (SVD) to extrapolate
particular, latent motivation. This kind of algorithms, along     underlying relations between users and items [31, 22].
with association rule mining, neural networks and Boltzman           Content-based recommender systems base recommenda-
machines, even if promising, are however difficult to inte-       tion on the features extracted from the items’ content, such
grate in existing systems, as they use a completely different     as: the actors, the genre, and the summary of a movie. The
approach from traditional collaborative filtering and do not      typical bag-of-word approach represents items as vectors of
properly scale with the number of users and items.                features stored into a feature-by-item matrix, which is re-
   Also, hybrid algorithms are subject to a series of problem-    ferred to as Item Content Matrix (ICM) and denoted by W.
atics. First, many of them (e.g., weighted hybrids) need to       Items’ content requires to be processed in order to identify
use underlying algorithms that are able to perform rating         terms (tokenizing), to remove useless words (stop words),
predictions. This makes them useless on implicit domains          to normalize words (stemming), and to weight each feature
like iTV applications: algorithms which rely on predicting        (weighting). As for the latter, weights are usually computed
ratings strictly need explicit ratings to work properly as they   using the TF-IDF metric [29]. Users can be represented
are usually optimized with respect to error metrics (such as      as feature vectors, composed by the features of the movies
MAE or RMSE - e.g., [22]). However, the presence of only          they have watched. Rating prediction can be computed as
implicit, binary ratings (as in our datasets) does not allow      the similarity between user and item feature vectors.
to use this set of algorithms since we do not have proper            In the following section we describe the state-of-the-art
ratings.                                                          recommender algorithms we have taken into consideration.
   Additionally, some of the algorithms have scalability is-      They comprise two collaborative filtering algorithms, one
sues. As an example, CBCF [24] requires augmented user            content-based algorithm, and two hybrid algorithms.
profiles in order to produce recommendations. Since it is            The algorithms we have considered are designed to be used
impossible to store the augmented URM (it would be a full         with implicit, binary datasets. Furthermore, the algorithms
matrix), they need to be generated at real-time. However,         share the property of not requiring any user-specific param-
this requires to compute the pseudo ratings for all the user      eters to be learned in advance. As a consequence, these
in the neighborhood using a naive bayes classifier. Using the     algorithms are able to recommend any user at real-time on
neighborhood size suggested by Melville et. al., this would       the basis of his/her most recent ratings.
mean generating 30 recommendation lists using the bayesian           The state-of-the-art collaborative and content-based al-
classifier just to produce one hybrid recommendation list.        gorithms will be used in Section 4 for defining two families
   Other simpler hybrid algorithms belonging to the weighted,     of hybrid recommender algorithms, each one combining one
switching, and mixed categories, need to generate at real-        content-based and one collaborative filtering: filtered feature
time a number of recommendation lists which is equal to           augmentation and similarity injection.
the number of the underlying baseline recommenders. The
consequence of this is that even the simplest algorithm be-
longing to those categories (e.g., the interleaved approach       3.1    Collaborative
described in Section 3.3) would have a halved throughput             In the following we present one item-based (NNCosNgbr)
with respect to a non-hybrid solution.                            and one latent factor (PureSVD) state-of-the-art collabora-
   The only state-of-the-art hybrid approach with a complex-      tive algorithms suitable for predicting top-N recommenda-
ity comparable to non-hybrid algorithms is SimComb [25];          tions in the case of binary ratings.
however, we did not find any reference of this solution ap-
plied to implicit datasets.
                                                                  Non-normalized cosine KNN (NNCosNgbr).
                                                                     This is a state-of-the-art item-based collaborative filtering
3. STATE-OF-THE-ART ALGORITHMS                                    algorithm described in [20]. Note that in the case of binary
  Recommender algorithms usually collect user ratings in a        ratings we cannot compute similarity metrics such as the
user-by-item matrix, from here on referred to as User Rating      Pearson coefficient and the Adjusted Cosine, thus the item-
Matrix (URM) and denoted by R, where rui is the rating            item similarity has been measured as the cosine similarity.
given by user u to item i. The u-th row of this matrix is            Rating prediction for item i has been computed by sum-
denoted by ru and represents the profile of user u.               ming up the similarities between item i and its neighbors,
  The URM is typically very sparse (users rate, on aver-          where the set of neighbors - denoted by Dk (u; i) - has been
age, only a few items). In the settings of iTV we assume to       limited to the k most similar items. This approach is usually
have only implicit feedbacks, so that the URM contains bi-        known as knn (k nearest neighborhood). Rating prediction
is computed as:                                                   3.3      Hybrid algorithms
                                X                                   In the following sections we present two families of hybrid
                    r̂ui =                 ruj · sCF
                                                  ij       (1)
                                                                  recommender systems based on one collaborative filtering
                             j∈D k (u;i)
                                                                  and one content-based filtering. The requirement for an al-
where sCF
        ij represents the cosine similarity between item i
                                                                  gorithm to be used in the second solution is to allow to derive
and j, computed as1 :                                             a similarity measure between items. Thus, all state-of-the
                                                                  art collaborative and content-based algorithms presented in
                                    ri · r⊺j                      Sections 3.1 and 3.2 can be used in the following hybrid
                      sCF
                       ij =                                (2)
                                kri k2 · krj k2                   approaches.
This algorithm has been proved to have a good quality, as
shown in [11] and [15].                                           Interleaved.
                                                                    This algorithm is a trivial hybrid implementation that
                                                                  forms the recommendation list to suggest to the target user
PureSVD.
                                                                  by alternating, in turn, one item predicted by the collab-
   PureSVD is a latent factor collaborative algorithm, proved
                                                                  orative algorithm and one predicted by the content-based
to grant high recommendation quality in terms of recall [11].
                                                                  algorithm.
   Let f denote the number of latent factors used for rep-
resenting users and items. Typical values of f are in the
range [50, 300]. SVD allows to factorize the URM into three       SimComb.
f -dimensional matrices - U, Σ, and Q - and to compute a             Mobasher et. al. [25] developed a weighted hybrid algo-
f -rank approximation of the original URM:                        rithm in which item-to-item similarity values are computed
                                                                  as the linear combination between content-based and collab-
                       Rf = U · Σ · Q ⊺                    (3)    orative similarities:
where U and Q are orthonormal matrices and Σ is diagonal.                           cij = α · sCBF
                                                                                               ij  + (1 − α) · sCF
                                                                                                                ij            (9)
User u and item i are represented by f -dimensional vectors,
respectively: pu and qi . Rating prediction is computed us-       where sCBF
                                                                         ij      and sCF
                                                                                      ijare computed, respectively, using (2)
ing the inner product:                                            and (7). Finally, rating prediction can be computed using
                                                                  (1), where sCF
                                                                              ij is to be replaced with cij . We refer to this
                         r̂ui = pu · q⊺i                   (4)    algorithm as simComb.
If we define P = Σ · U, we can observe that, since U and Q
are orthonormal, P = R · Q and (4) can be rewritten as:           4.     PROPOSED ALGORITHMS
                        r̂ui = ru · Q · q⊺i                (5)       The algorithms we present in the following sections are
                                                                  designed to overcome some of the limitations of existing al-
    Finally, observe that the inner product qi · q⊺j represents   gorithms. Our first priority was to design hybrid recom-
the similarity between items i and j.                             mender systems able to work on implicit, binary ratings,
3.2 Content-based                                                 and to grant good recommendation quality even when only
                                                                  content information is available (new items). In addition,
  Differently to collaborative filtering, content-based recom-
                                                                  we also focus on scalability in order to design algorithms to
mender systems rely only on content features. In the follow-
                                                                  be used in real iTV scenarios.
ing we describe LSA [2], a well-known approach based on
                                                                     In the next two sections we present two families of hybrid
SVD.
                                                                  solutions: Filtered Feature Augmentation and Similarity In-
                                                                  jection Knn.
Latent Semantic Analysis (LSA). Latent Semantic Anal-
ysis [16, 19, 2] uses SVD to reduce the dimensionality of the     4.1      Filtered Feature Augmentation (FFA)
ICM and to capture underlying relations - like synonymy -            Filtered Feature Augmentation is a feature augmentation
between items. The weights in the ICM are computed us-            algorithm mainly inspired by CBCF (Content Boosted Col-
ing the TF-IDF metric[29]. Let l be the number of factors         laborative Filtering) [24]. Differently from Mellville’s solu-
- typically referred to as latent size - to be used for rep-      tion, our approach
resenting items’ features. The ICM can be factorized and
approximated as:                                                       1. does not need any user-specific parameter to be learned;
                                               ⊺
                       Wl = Z · Λ · Y                      (6)         2. allows to use different content-based and collaborative
Let us define B = YΛ. Similarity between items i and j is                 filtering;
denoted by sCBF
             ij and can be computed as the cosine between
                                                                       3. computes item-item similarities on the basis of the rat-
vectors bi and bj :
                                                                          ings augmented with pseudo-ratings derived from the
                                    bi · b⊺j                              content-based filtering, but use the original user pro-
                     sCBF
                      ij  =                                (7)            files for predicting ratings (as opposed to computing
                                 kbi k2 · kbj k2
                                                                          item-item similarities using the original URM and aug-
and rating prediction is computed as:                                     menting user profiles for predicting ratings).
                        r̂ui = ru · B · b⊺i                (8)
                                                                    Figure 1 shows the learning process: the CBF trains the
1                                                                 content-based algorithm and computes the pseudo-ratings
 x · y denotes the inner product between vectors x and y,
and kxk2 denotes the Euclidean norm of vector x.                  for unknown rating values to be added to the original URM.
                                                                    Table 1: Dataset characteristics: number of different
                                                                    values for each type of feature (a) and number of
                                                                    items for each language (b).

                                                                          (a) Types of features            (b) Languages
                                                                          Type         #values        Language       #items
                                                                          actors           693        German             703
                                                                          categories         8        French              84
                                                                          directors         88        Italian              7
                                                                          genres            16
Figure 1: Model generation for FFA. The CBF stage                         languages          3
comprises content-based model generation and rec-
ommendations for every item.
                                                                         • all the elements sCF
                                                                                              ij ’s are copied into S. Thus, each
                                                                           items is filled with k similarity values deriving from
The augmented URM (namely, aURM) is used as input for
                                                                           the collaborative filtering.
the collaborative filtering.
  Since adding all pseudo-ratings would lead to a dense,                 • the elements sCBF  ’s are later inserted into the corre-
                                                                                          ij
very large augmented URM, the filter selects only the most                 sponding empty (e.g., zeros) elements of S in such a
relevant pseudo ratings. In our experiments we used two                    way to have, for each item, a set of k additional similar-
different filters: a simple one which excludes all the pseudo-             ity measures deriving from the content-based filtering.
ratings which are lower than a fixed threshold (FFAt) and a
more sophisticated one which use the Gini impurity measure             At the end of the construction, each item will have a total
[14] in order to add both high and low pseudo ratings to in-        of 2k similar items:
crease the intrinsic information to the item profiles (FFAg).
The Gini coefficient is defined as:                                      • the k-most collaborative-based similar items
                                     X 2
                       Gini(v) = 1 −    px                (10)           • the k-most content-based similar items not appearing
                                     x∈v                                   in the previous set
where px is the probability of x in v. In our specific case v       .
is an item profile (i.e., a column of the URM) and px = nnx ,          Exactly half the similarities come from the content-based
where nx is the number of ratings equal to x in v and n             technique and half from the collaborative technique.
is the number of ratings in v. When Gini(v) = 0, v is                  The framework allows different combinations of item-based
pure (and brings almost no information). As we want to              collaborative and content-based filtering. In this work we
add informative pseudo-ratings, the filter let only pass the        present the results based on LSA as content-based algo-
pseudo-ratings that increment the most the Gini index for           rithm and NNCosNgbr as collaborative filtering, since they
each item. This is done until at least g pseudo ratings are         reached the highest quality in terms of recall. Thus, sCF
                                                                                                                           ij and
added to each item-profile. The value of g depends on the           sCBF   have been  computed    using (2) and (7), respectively.
                                                                     ij
number of original ratings for each user profile (denoted by
                                                                    Finally, rating prediction is estimated with (1), where sCFij
n):
                                                                    is replaced by sij .
                (
                  nmin − n + (h · n) if n ≤ nmin
           g = (h·n2min )                                (11)
                                        otherwise                   5.     DATASET AND EVALUATION METHOD-
                       n
                                                                           OLOGY
where nmin and h are parameters. In our experiments we
used the average number of ratings as the value for nmin               We evaluated our algorithms using a variant of the method-
and 0.3 for h.                                                      ology described in [22, 11], designed to evaluate the perfor-
  Rating prediction has been computed by using (1), where           mance of recommender algorithms with focus on the new-
rui are the the original user ratings, and sCF                      item problem.
                                            ij is the similarity
between items i and j computed using (2) on the augmented              We used the dataset provided by an IPTV provider2 . It
URM.                                                                contains 25765 binary ratings implicitly collected over a pe-
                                                                    riod of six months given by 15563 users on 794 items (0.0021
4.2 Similarity Injection Knn (simInjKnn)                            sparsity). Content-based features comprise: actors, direc-
                                                                    tors, category (e.g., movie, series, documentary. . . ), title,
   Similarity Injection Knn builds a model using item-item
                                                                    genres, summary and language, for a total of 11291 features
similarities obtained by one collaborative and one content-
                                                                    and an average of 18 features per item. The main charac-
based.
                                                                    teristics of the available features are summarized in Table
   We first compute item-item similarities sCF  ij using collabo-
                                                                    1.
rative filtering, retaining only the k most similar items. Sim-
                                                                       The evaluation of recommender systems is typically per-
ilarly, we compute item-item similarities sCBF
                                             ij    using content-
                                                                    formed by using either error metrics or accuracy metrics
based filtering.
   The similarities are later merged into a unique item-item        2
                                                                      Ratings refers to the dataset TV2               available   at
similarity matrix S, by adopting a two-step process:                http://home.dei.polimi.it/cremones/memo/
[18]. Since error metrics - such as MAE (mean absolute
error) and RMSE (root mean square error) - rely on com-
puting the error between actual and predicted ratings, they
cannot be measured on implicit, binary datasets where this
information is not available [20, 15].
  For such reason we focus on accuracy metrics, that esti-
mate the fraction of relevant items which are actually recom-
mended (recall) or the fraction of the recommended items
that are actually relevant (precision). In addition, recent
works [11, 27] consider accuracy metrics as more suitable,
with respect to error metrics, for evaluating the top-N rec-
ommendation task [18], i.e., the capability of the recom-
mender system to suggest very limited lists of items that
are likely to be of interest for the users.                       Figure 2: Sliding window. Blank circles represent
  The standard definition of recall - which is typically used     ratings in the set T , solid circles represent ratings in
in the settings of information retrieval - is:                    the test set, and x’s represent ratings in the training
                                                                  set.
                          |relevant ∧ retrieved|
               recall =                                   (12)
                                |relevant|
                                                                     Let M denote the number of items in the dataset. We
   Usually, in the settings of recommender systems, the set of    randomly divide items in two sets, H1 and H2 , each one
relevant items is composed by items positively rated (e.g.,       composed by M/2 items. Test set and training set are de-
if the rating is 5 out of 5). However, since we are facing        fined as a function of the percentage parameters β as follows:
with an implicit, binary dataset, in our evaluation we have
considered - analogously to other works such as [15, 12, 13] -       • we define a set T by randomly selecting 20% of ratings
all rated items to be relevant, as we do not have any further          in the URM.
information about the degree of user satisfaction.
                                                                     • the training set is defined as the set of ratings related
5.1 Performance on new items                                           to items in H1 , excluding all ratings in T .
  In order to specifically evaluate the impact of new items on       • we form a set H1+2 composed by M/2 items, 100 −
the quality of the different algorithms we developed a testing         β% randomly extracted from set H1 and β% randomly
methodology, that we refer to as sliding window, which is an           extracted from set H2 .
extension of the evaluation methodology presented in [22,
11].                                                                 • test set is composed by the ratings in T related to
  The original approach evaluates the quality of recommen-             items in H1+2 .
der algorithms by measuring the recall as a function of the
                                                                  Figure 2 schematically shows how the different sets are for-
number of items displayed to the user (N ). The test consists
                                                                  med. Blank circles, solid circles, and x’s refer to the ratings,
in excluding a certain amount of ratings (test set) and using
                                                                  respectively, in the set T , in the test set, and in the training
the remaining ratings to train the algorithms (training set).
                                                                  set.
All available content-based features are used for training the
                                                                    For each value of β we have composed a training and a test
content-based and hybrid algorithms. Once an algorithm
                                                                  set and we have computed the quality of the recommender
has been trained with the ratings in the training set, each
                                                                  algorithm in terms of recall, as defined in (13), where N
rating rui in the test set is tested as follows:
                                                                  has been set equal to 20. The test is the same as the one
   • we predict the score for all items unrated by user u         described for the original evaluation methdology. For each
                                                                  rating rui in the test set: (i) we predict the score for all items
   • we select the top-N items according to the estimated
                                                                  unrated by user u, (ii) we select the top-20 items according
     score
                                                                  to the estimated score, and (iii) we verify if there has been
   • if item i appears in the top-N recommendation list, we       a hit, i.e., if item i appears in the top-20.
     have a ‘hit’, i.e., the system has correctly suggested a       Note that only ratings related to half the items are avail-
     relevant item.                                               able for training the algorithms. The parameter β specifies
                                                                  the percentage of new-items, i.e., the percentage of items
With respect to (12), the set of relevant items corresponds to
                                                                  that do not have ratings in the training set and so that can-
the test set, while the set of relevant, retrieved items corre-
                                                                  not be recommended by standard collaborative filtering. In
sponds to the number of hits. Thus, recall can be rewritten
                                                                  our experiments we varied β from 0% to 100%.
as a function of N , i.e., the number of items displayed to
                                                                    Collaborative filtering is expected to be able to recom-
users:
                                                                  mend only items in H2 , so the higher β the lower the quality
                                    #hits
                    recall(N ) =                           (13)   of the algorithm. When β = 100% we expect the quality of
                                  |test-set|                      any collaborative filtering to be 0.
Because of the high dataset sparsity, we have formed the            On the other hand, content-based algorithms are trained
test set by randomly selecting the 20% of ratings in order to     exclusively with content-based features, thus resulting to-
have a significant number of samples.                             tally independent from ratings included into the training set.
  This evaluation methodology is not able to measure the          We expect their quality not to be influenced by β. Finally,
quality of the recommendations on new item problem. There-        hybrid approaches can be training by the ratings related to
fore we have implemented some modifications.                      half the items and with all available content-based features.
6. RESULTS AND DISCUSSION                                           In future work we plan to implement between-subjects
   In our test we considered the state-of-the-art recommender     controlled experiments for a subjective evaluation of the pro-
algorithms described in Section 3 and the hybrid approaches       posed solutions. In fact, recent works have pointed out that
proposed in Section 4. As for the former, we included two         objective evaluation of recommender systems (i.e., based on
item-based collaborative algorithms – PureSVD and non-            error and accuracy metrics) is not always aligned with the
normalized cosine (NNCosNgbr) – a content-based algorithm         actual user perception of the recommendation quality, as
- Latent Semantic Analysis (LSA) – and two hybrid algo-           measured via controlled experiments (e.g., [9, 27]).
rithms – interleaved and simComb. As for the latter, we
included simInjKnn and two variants of filtered feature aug-      8.   REFERENCES
mentation, referred to as FFAg and FFAt. FFAg uses a               [1] G. Adomavicius and A. Tuzhilin. Toward the next
filter based on Gini’s impurity measure, while FFAt uses a             generation of recommender systems: a survey of the
filter based on a fixed threshold, that has been set to 1, thus        state-of-the-art and possible extensions. Knowledge
excluding all the pseudo-ratings lower than 1 (the number              and Data Engineering, IEEE Transactions on,
of pseudo-ratings added to the URM was the 27.5% of the                17(6):734–749, 2005.
original number of ratings).                                       [2] R. Bambini, P. Cremonesi, and R. Turrin.
   The latent size l of LSA has been set to 300, while the             Recommender Systems Handbook, chapter A
number of features f in PureSVD is equal to 50. The neigh-             Recommender System for an IPTV Service Provider:
borhood size k has been set equals to 200 for NNCosNgbr.               a Real Large-Scale Production Environment. to
As for SimComb, the coefficient α has been set to 0.3 as it            appear, Springer, 2010.
empirically provides the better results. Finally, the neigh-       [3] R. Bell, Y. Koren, and C. Volinsky. The BellKor
borhood size k used for simInjKnn is 100.                              solution to the Netflix Prize.
   Figure 3 shows the sliding window results, plotting the         [4] R. M. Bell and Y. Koren. Scalable Collaborative
recall of the state-of-the-art (a) and the proposed hybrid             Filtering with Jointly Derived Neighborhood
(b) algorithms as a function of β, i.e., the percentage of new         Interpolation Weights. Seventh IEEE International
items. The recall has been computed assuming N = 20.                   Conference on Data Mining (ICDM 2007), pages
   The most perceptible result is the very poor quality of the         43–52, Oct. 2007.
content-based algorithm, whose recall does not go over 3%,
                                                                   [5] D. Billsus and M. J. Pazzani. User modeling for
much lower than the best collaborative approach - NNCos-
                                                                       adaptive news access. User Modeling and
Ngbr - whose recall is about 25% when recommending only
                                                                       User-Adapted Interaction, 10:147–180, February 2000.
old items. In addition, we can observe that the three hybrid
                                                                   [6] R. Burke. Hybrid recommender systems: Survey and
algorithms - simComb, FFAt, and simInjKnn - have a recall
higher than collaborative and content-based state-of-the-art           experiments. volume 12, pages 331–370, Hingham,
solutions. As for the other two hybrid algorithms: the in-             MA, USA, November 2002. Kluwer Academic
                                                                       Publishers.
terleaved approach confirms to have a poor quality as based
on a very trivial technique, while the lower quality of FFAg       [7] R. Burke. The adaptive web. chapter Hybrid web
is motivated by the fact that binary ratings do not bring              recommender systems, pages 377–408.
enough information for the filter to work properly. However,           Springer-Verlag, Berlin, Heidelberg, 2007.
the results of the sliding window test show that FFAt and          [8] L. Candillier, K. Jack, F. Fessant, and F. Meyer.
SimInjKnn outperform the state-of-the-art algorithm Sim-               State-of-the-Art Recommender Systems. IGI Global,
Comb when more than 40% of items is new. In addition, we               pages 1–22, 2009.
can notice that recall drastically falls down when β = 100%,       [9] L. Chen and P. Pu. A user-centric evaluation
with exception for the content-based algorithm that is not             framework of recommender systems. In In ACM
influenced by the presence of unrated items.                           Conference on Recommender Systems (RecSysŠ10),
   As a final observation, let us consider the performance im-         Workshop on User-Centric Evaluation of
plications of the proposed hybrid solutions. The Similarity            Recommender Systems and Their Interfaces
Injection Knn (SimInjKnn) requires to build two item-to-               (UCERSTIŠ10)., pages 14–21, Barcelona, Spain, 2010.
item similarity matrices, which is the same effort as required         Sept. 26-30, 2010.
for the interleaved algorithm. Filtered Feature Augmenta-         [10] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov,
tion (FFAg and FFAt), on the other hand, requires to com-              D. Netes, and M. Sartin. Combining content-based
pute pseudo-ratings for each item. According to the number             and collaborative filters in an online newspaper. In
of items and users, this can be computationally costly. The            Proceedings of ACM SIGIR Workshop on
main benefit of Filtered Feature Augmentation is that it is            Recommender Systems, August 1999.
completely independent from the collaborative and content-        [11] P. Cremonesi, Y. Koren, and R. Turrin. Performance
based algorithms used, as opposed to Similarity Injection              of recommender algorithms on top-n recommendation
Knn, which strictly requires algorithms able to express item-          tasks. In RecSys, pages 39–46, 2010.
item similarities.                                                [12] P. Cremonesi and R. Turrin. Analysis of cold-start
                                                                       recommendations in iptv systems. In RecSys ’09:
                                                                       Proceedings of the 2009 ACM conference on
7. CONCLUSIONS AND FUTURE WORKS                                        RecommenderSystems, pages 1–4. ACM, 2009.
   The new hybrid algorithms we proposed have been proved         [13] P. Cremonesi and R. Turrin. Time-evolution of iptv
to improve the overall performance of the state-of-the-art al-         recommender systems. In Proc. of the 8th European
gorithms in suggesting new items, though their performance             Conference on Interactive TV and Video, Tempere,
is limited by the poor quality of content-based algorithms.            Finland, June 2010. ACM.
                0.3                                                                               0.3
                                                                    interleaved                                                                      FFAg
                                                                    LSA300                                                                           FFAt
                                                                    NNCosNgbr                                                                        simInjKnn
               0.25                                                 simComb                      0.25
                                                                    pureSVD


                0.2                                                                               0.2
  Recall(20)


                                                                                    Recall(20)
               0.15                                                                              0.15


                0.1                                                                               0.1


               0.05                                                                              0.05


                 0                                                                                 0
                  0   10    20   30    40    50   60      70   80      90     100                   0   10   20   30    40    50   60      70   80    90     100
                                      % of new items(β)                                                                % of new items(β)

                           (a) state-of-the-art algorithms                                                        (b) new algorithms

Figure 3: Results of the sliding window test. Figure (a) refers to state-of-the-art approaches, while Figure
(b) to the proposed hybrid solutions.


[14] B. de Ville. Decision trees for business intelligence and                              pages 426–434, New York, NY, USA, 2008. ACM.
     data mining: using SAS enterprise miner. SAS                                      [23] P. Lops, M. Gemmis, and G. Semeraro. Content-based
     Publishing, 2006.                                                                      Recommender Systems: State of the Art and Trends.
[15] M. Deshpande and G. Karypis. Item-based top-n                                          In F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor,
     recommendation algorithms. ACM Transactions on                                         editors, Recommender Systems Handbook, chapter 3,
     Information Systems (TOIS), 22(1):143–177, 2004.                                       pages 73–105. Springer US, Boston, MA, 2011.
[16] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K.                                  [24] P. Melville, R. J. Mooney, and R. Nagarajan.
     Landauer, R. A. Harshman, L. A. Streeter, and K. E.                                    Content-boosted collaborative filtering for improved
     Lochbaum. Information retrieval using a singular                                       recommendations. In Eighteenth national conference
     value decomposition model of latent semantic                                           on Artificial intelligence, pages 187–192, Menlo Park,
     structure. pages 465–480, New York, NY, USA, 1988.                                     CA, USA, 2002. American Association for Artificial
     ACM Press.                                                                             Intelligence.
[17] M. A. Ghazanfar and A. Prugel-Bennett. A scalable,                                [25] B. Mobasher, X. Jin, and Y. Zhou. Semantically
     accurate hybrid recommender system. In Proceedings                                     Enhanced Collaborative Filtering on the Web. In
     of the 2010 Third International Conference on                                          Proceedings of the 1st European Web Mining Forum
     Knowledge Discovery and Data Mining, WKDD ’10,                                         (EWMF2003), pages 57–76, Sept. 2003.
     pages 94–98, Washington, DC, USA, 2010. IEEE                                      [26] R. Navarro-Prieto, P. Rebaque-Rivas, and
     Computer Society.                                                                      J. Hernández-Pablo. Recommending content for itv:
[18] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl.                                    what the users really want? In Proceedings of the 8th
     Evaluating collaborative filtering recommender                                         international interactive conference on Interactive
     systems. ACM Transactions on Information Systems                                       TV&#38;Video, EuroITV ’10, pages 123–126, New
     (TOIS), 22(1):5–53, 2004.                                                              York, NY, USA, 2010. ACM.
[19] P. Husbands, H. Simon, and C. Ding. On the use of                                 [27] S. N. A. P. R. T. Paolo Cremonesi, Franca Garzotto.
     singular value decomposition for text retrieval. Oct.                                  Comparative evaluation of recommender system
     2000.                                                                                  quality. In CHI extended abstract on Human factors in
[20] G. Karypis. Evaluation of item-based top-n                                             computing systems. ACM - to appear, 2011.
     recommendation algorithms. In Proceedings of the                                  [28] M. J. Pazzani. A framework for collaborative,
     tenth international conference on Information and                                      content-based and demographic filtering. Artif. Intell.
     knowledge management, CIKM ’01, pages 247–254,                                         Rev., 13:393–408, December 1999.
     New York, NY, USA, 2001. ACM.                                                     [29] G. Salton, editor. Automatic text processing.
[21] J. Kittler, M. Hatef, R. Duin, and J. Matas. On                                        Addison-Wesley Longman Publishing Co., Inc.,
     combining classifiers. Pattern Analysis and Machine                                    Boston, MA, USA, 1988.
     Intelligence, IEEE Transactions on, 20(3):226 –239,                               [30] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
     mar 1998.                                                                              Item-based collaborative filtering recommendation
[22] Y. Koren. Factorization meets the neighborhood: a                                      algorithms. 10th Int. Conf. on World Wide Web,
     multifaceted collaborative filtering model. In KDD ’08:                                pages 285–295, 2001.
     Proceeding of the 14th ACM SIGKDD international                                   [31] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
     conference on Knowledge discovery and data mining,                                     Incremental singular value decomposition algorithms
     for highly scalable recommender systems. 5th Int.
     Conf. on Computer and Information Technology
     (ICCIT 2002), pages 399–404, 2002.
[32] A. Schein, A. Popescul, L. Ungar, and D. Pennock.
     Generate models for cold-start recommendations. In
     ACMSIGIR Workshop on RecommenderSystems.,
     2001.
[33] A. Schein, A. Popescul, L. Ungar, and D. Pennock.
     Methods and metrics for cold-start recommendations.
     In Proceedings of the 25th Annual International ACM
     SIGIR Conference on Research and Development in
     Information Retrieval (SIGIR 2002), pages 253–260,
     2002.
[34] B. Smyth and P. Cotter. A personalised tv listings
     service for the digital tv age. Knowl.-Based Syst.,
     13(2-3):53–59, 2000.
[35] H. Zhang, S. Zheng, and J. Yuan. A personalized tv
     guide system compliant with mhp. Consumer
     Electronics, IEEE Transactions on, 51(2):731–737,
     May 2005.
[36] J. Zimmerman, K. Kurapati, A. L. Buczak,
     D. Schaffer, S. Gutta, and J. Martino. Chapter 5 tv
     personalization system design of a tv show
     recommender engine and interface.