Modeling Difficulty in Recommender Systems

                                              Benjamin Kille, Sahin Albayrak
                                                            DAI-Lab
                                                   Technische Universität Berlin
                                                    {kille,sahin}@dai-lab.de


ABSTRACT                                                                  algorithms are compared with respect to their average per-
Recommender systems have frequently been evaluated with                   formance on the full set of users. This entails that all users
respect to their average performance for all users. However,              are treated equally. However, we argue that the difficulty of
optimizing such recommender systems regarding those eval-                 recommending items to users varies. Suppose we consider a
uation measures might provide worse results for a subset of               recommender system with two users, Alice and Bob. Alice
users. Defining a difficulty measure allows us to evaluate and            has rated a large number of items. Bob has recently started
optimize recommender systems in a personalized fashion.                   using the system and rated a few items. The system rec-
We introduce an experimental setup to evaluate the eligibil-              ommends a number of items to both of them. Alice and
ity of such a difficulty score. We formulate the hypothesis               Bob rate the recommended items. Suppose we attempt to
that provided a difficulty score recommender systems can be               evaluate two recommender algorithms based on those rat-
optimized regarding costs and performance simultaneously.                 ings, denoted as R1 and R2 . Assume that R1 predicts Al-
                                                                          ice’s ratings with an error of 0.8 and Bob’s ratings with an
                                                                          error of 0.9. On the other hand, we observe R2 to devi-
Categories and Subject Descriptors                                        ate 1.0 for Alice and 0.8 for Bob, respectively. Averaging
H.3.3 [Information Storage and Retrieval]: Information                    the errors, we obtain 0.85 for R1 and 0.9 for R2 . Still, R2
Search and Retrieval - Information filtering, Retrieval mod-              predicts Bob’s ratings better even though his preferences ex-
els, Search process, Selection process; H.3.4 [Information                hibit higher sparsity compared to Alice’s. Besides the num-
Technology and Systems Applications]: Decision sup-                       ber of ratings, there are further factors discriminating how
port                                                                      well an recommendation algorithm perform for a given user.
                                                                          We introduce the notion of difficulty in the context of rec-
                                                                          ommender systems. Each user is assigned a difficulty value
General Terms                                                             reflecting her expected evaluation outcome. Users with high
Algorithms, Design, Experimentation, Measurement, Hu-                     errors or disordered item rankings receive a high difficulty
man factors                                                               value. Contrarily, we assign low difficulty values to users
                                                                          exhibiting low errors and well ordered item rankings. Rec-
Keywords                                                                  ommender systems benefit from those difficulty value two-
                                                                          fold. First, optimization can target difficult users who re-
difficulty, recommender systems, user modeling, evaluation                quire such efforts. On the other hand, the recommender
                                                                          system can provide users with low difficulty values with rec-
1.   INTRODUCTION                                                         ommendations generated by more trivial methods. Recom-
  Evaluating a recommender system’s performance repre-                    mending most popular items represents such an approach.
sents a non-trivial task. The choice of evaluation measure                Second, difficulty values enable the system to estimate how
and methodology depends on various factors. Modeling the                  likely a specific user will perceive the recommendations as
recommendation task as rating prediction problem favors                   adequate. Thereby, the system can control interactions with
measures such as root mean squared error (RMSE) and                       users, e.g. by asking for additional ratings to provide better
mean absolute error (MAE) [7]. In contrast, mean aver-                    recommendations for particularly difficult users.
age precision (MAP) and normalized discounted cumulative
gain (NDCG) qualify as evaluation measures for an item
ranking scenario [12]. In either case, two recommendation
                                                                          2.   RELATED WORK
                                                                            Adomavicius and Tuzhilin reveal possible extensions to
                                                                          recommender systems [1]. They mention an enhanced un-
                                                                          derstanding of the user as one such extension. Our work
                                                                          attempts to contribute to this by defining a user’s difficulty.
                                                                          Hereby, the system estimates a user’s likely perception of
                                                                          the recommendations. The methods performing best on the
Copyright is held by the authors/owner(s).                                well-known Netflix Prize Challenge had applied ensemble
Workshop on Recommendation Utitlity Evaluation: Beyond RMSE (RUE          techniques to further improve their rating prediction accu-
2012), held in conjunction with ACM RecSys 2012 September 9, 2012,
Dublin, Ireland.                                                          racy [3, 11]. Both do not state an explicit modeling of rec-
.                                                                         ommendation difficulty on the user-level. However, ensem-


                                                                     30
ble techniques involve an implicit presence of such a con-              3.1     Q statistics
cept. Combining the outcome of several recommendation                      Applied to the classification ensemble setting, the Q statis-
algorithms does make sense in case a single recommenda-                 tics base on a confusion matrix. The confusion matrix con-
tion algorithm fails for a subset of users. Such an effect is           fronts two classifiers in a binary manner - correctly classified
consequently compensated by including other recommenda-                 versus incorrectly classified. We need to adjust this set-
tion algorithms’ outcomes. Bellogin presents an approach                ting to fit the item ranking scenario. Table 1 illustrates the
to predict a recommender systems performance [4]. How-                  adjusted confusion matrix. We count all correctly ranked
ever, the evaluation is reported on the system level aver-              items as well as all incorrectly ranked items for both recom-
aging evaluation measures over the full set of users. Our               mendations algorithms. Subsequently, the confusion matrix
approach focuses on predicting the user-level difficulty. Ko-           displays the overlap of those items sets. Equation 1 rep-
ren and Sill introduced a recommendation algorithms that                resents the Q statistic derived from the confusion matrix.
outputs confidence values [8]. Those confidence values could            Note that the Q statistic measures diversity in between two
be regarded as difficulty. However, the confidence values are           recommender algorithms. Hence, we need to average the Q
algorithm specific. We require the difficulty score’s valid-            statistic values obtained for all comparisons in between the
ity to generally hold. The concept of evaluation considering            available algorithms. Equation 2 computes the final diver-
difficulty has been introduced in other domains. Aslam and              sity measure.
Pavlu investigated estimation of a query’s difficulty in the
context of information retrieval [2]. Their approach bases
on the diversity of retrieved lists of documents by several                                         N 11 N 00 − N 01 N 10
                                                                                        Qij (u) =                                   (1)
IR systems. Strong agreement with respect to the docu-                                              N 11 N 00 + N 10 N 01
ment ranking indicates a low level of difficulty. In contrast,
highly diverse rankings suggest a high level of difficulty. He                                            !−1
                                                                                                    |A|           X
et al. describe another approach to estimate a query’s dif-                            QA (u) =                             Qij     (2)
ficulty [6]. Their method compares the content of retrieved                                          2          Ai ,Aj ∈A
documents and computes a coherence score. They assume
that in case highly coherent documents are retrieved the                3.2    Difficutly measure θ
query is clear and thus less difficult than a query resulting in           Kuncheva and Whitaker introduce the difficulty measure
minor coherency. Genetic Algorithms represent another do-               θ as a non-pairwise measure [9]. θ allows us to consider
main where difficulty has received the research community’s             several recommendation algorithms simultaneously. We it-
attention. However, the difficulty is determined by the fit-            erate over the set of withhold items for the given user. At
ness landscape of the respective problem [5, 6]. Kuncheva               each step we compute the RMSE for all recommendation
and Whitaker investigated diversity in the context of a clas-           algorithms at hand. Thereby, we observe the variance in
sification scenario [9]. Concerning a query’s difficulty, di-           between all recommender algorithms’ RMSE values. The
versity has been found an appropriate measure [2]. We will              average variance displays how diverse the user is perceived
adapt two diversity measures applied in the classification              by the set of recommendation algorithms. Equation 3 cal-
scenario for determining the difficulty to recommend a user             culates θ. I denotes the set of items in the target user’s test
items.                                                                  set. The σ(i) function computes the variances in between
                                                                        the recommendation algorithms’ RMSE values for the given
3.   METHOD                                                             item.
   We strive to determine the difficulty of the recommenda-
tion task for a given user. Formally, we are looking for a                                              1 X
map δ(u) : U 7→ [0; 1] that assigns each user u ∈ U a real                                   θ(u) =            σ(i)                 (3)
                                                                                                       |I| i∈I
number between 0 and 1 corresponding to the level of diffi-
culty. For one recommendation algorithm the difficulty for
a given user could be simply determined by a (normalized)
                                                                        3.3    Difficulty measure δ
evaluation measure, e.g. RMSE. However, such a difficulty                  We attempt to formulate a robust difficulty measure for
would not be valid for other recommendation algorithms.                 recommender systems. Therefore, we propose a combination
In addition, recommender systems optimized with respect                 of the Q statistic and θ. As a result, both RMSE and NDCG
to the item ranking would likely end up with different dif-             are considered. The more recommendation algorithms we
ficulty values. We adapt the idea of measuring difficulty               include the more robust the final difficulty value will be.
in terms of diversity to overcome those issues. We assume               Equation 4 displays a linear combination of both measures.
the recommender systems comprises several recommenda-                   The parameter λ ∈ [0; 1] controls the weighting. λ = 1
tions methods, denoted A = {A1 , A2 , . . . , An }. Each such           allows to focus on the ranking task. The absence of rating
An generates rating predictions along with item rankings                values might require such a setting.
for a given user u. We choose RMSE to evaluate the rat-
ing predictions and NDCG to assess the item ranking, re-                             δ(u|A) = λ · QA (u) + (1 − λ) · θ(u)           (4)
spectively. We measure a user’s difficulty by means of the
diversity of those rating predictions and item rankings. For
that purpose, we adjusted two diversity metrics introduced              4.    EXPERIMENTAL SETTING
by Kuncheva and Whitaker for classification ensembles [9].                We will evaluate the proposed difficulty measure δ and
We alter the pair-wise Q statistics to fit the item ranking             subsequently outline the intended experimental protocol. Sev-
scenario. Regarding the rating prediction we adapt the dif-             eral data sets, such as Movielens 10M, Netflix and Last.fm
ficulty measure θ.                                                      provide the required data. We will implement item-based


                                                                   31
                                                    Ai : rank(k) ≤ rank(l)        Ai : rank(k) > rank(l)
                             Aj : rank(k) ≤ rank(l)           N 11                         N 10
                             Aj : rank(k) > rank(l)           N 01                         N 00
                             Where rank(k) ≤ rank(l) is true.


                                             Table 1: Adjusted confusion matrix


and user-based neighborhood recommenders, matrix factor-                7.   REFERENCES
ization recommenders along with slope-one recommenders.                  [1] G. Adomavicius and A. Tuzhilin. Toward the next
All those recommendation methods do not require data ex-                     generation of recommender systems: A survey of the
cept user preferences. As a first step, we split each data sets              state-of-the-art and possible extensions. IEEE
in three partitions. The first partition will be used for train-             Transaction on Knowledge and Data Engineering,
ing and the second partition to assign each user a difficulty                17(6):734–749, 2005.
score according to Equation 4. Finally, the third partition              [2] J. A. Aslam and V. Pavlu. Query hardness estimation
will be used as evaluation data set. The subset of users                     using jensen-shannon divergence among multiple
with comparably low difficulty score will receive recommen-                  scoring functions. In Proceedings of the 29th European
dations based on a non-optimized recommendation method                       conference on IR research, ECIR’07, pages 198–209,
such as the most popular recommender. The subset of users                    2007.
with medium difficulty score will receive recommendations
                                                                         [3] R. M. Bell and Y. Koren. Lessons from the netflix
generated by slightly optimized recommender algorithms.
                                                                             prize challenge. SIGKDD Explor. Newsl., 9(2):75–79,
Finally, the particularly hard users will receive recommenda-
                                                                             2007.
tions generated by highly optimized recommenders. We will
                                                                         [4] A. Bellogin. Predicting performance in recommender
compare both required efforts and recommendation quality
                                                                             systems. In Proceedings of the fifth ACM conference
against approaches dealing with all users in the same ways.
                                                                             on Recommender systems, pages 371–374. ACM, 2011.
Such approaches will include optimizing recommenders to
perform well averaged over the full set of users and recom-              [5] M. Hauschild and M. Pelikan. Advanced
mending all users with trivial recommenders, for instance                    neighborhoods and problem difficulty measures. In
most popular recommendations. Our hypothesis is that the                     Proceedings of the 13th annual conference on Genetic
difficulty score will allow us to achieve lower costs along with             and evolutionary computation, GECCO ’11, pages
a comparably high recommendation quality by selecting an                     625–632, 2011.
appropriate recommendation algorithm for a given user. In                [6] J. He, M. Larson, and M. De Rijke. Using
addition, we will observe what user characteristics determine                coherence-based measures to predict query difficulty.
her difficulty.                                                              In Proceedings of the IR research, 30th European
                                                                             conference on Advances in information retrieval,
                                                                             ECIR’08, pages 689–694, 2008.
5.   CONCLUSION AND FUTURE WORK                                          [7] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl.
   We introduced the notion of difficulty in the context of                  Evaluating collaborative filtering recommender
recommending items to users. Measuring that task’s dif-                      systems. ACM Transactions on Information Systems,
ficulty on user-level allows a more personalized optimiza-                   22(1):5–53, 2004.
tion of recommender systems. Users with comparably low                   [8] Y. Koren and J. Sill. Ordrec: an ordinal model for
difficulty scores receive adequate recommendations without                   predicting personalized item rating distributions. In
much efforts. On the other hand, users with a compara-                       Proceedings of the fifth ACM conference on
bly high difficulty score may be asked to provide additional                 Recommender systems, RecSys ’11, pages 117–124,
data to improve the system’s individual perception. A com-                   New York, NY, USA, 2011. ACM.
bination of the Q statistic and the difficulty measure θ -               [9] L. Kuncheva and C. Whitaker. Measures of diversity
both adjusted to fit the recommendation scenario - allows                    in classifier ensembles and their relationship with the
us to measure the difficulty for a given user. The calculation               ensemble accuracy. Machine Learning, 51:181–207,
requires a sufficiently large data set containing user prefer-               2003.
ences on items along with a set of implemented recommen-                [10] A. Said, B. Jain, and S. Albayrak. Analyzing
dation algorithms. We introduced an experimental setting                     weighting schemes in collaborative filtering: Cold
that we will use to evaluate the presented methodology in                    start, post cold start and power users. In 27th ACM
section 4. In addition, we intend to apply pattern recog-                    Symposium On Applied Computing (SAC ’12), New
nition techniques to find correlations between the difficulty                York, NY, USA, 2012. ACM.
score and user characteristics. For instance, the number
                                                                        [11] G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. Major
of ratings is known to correlate with the difficulty for new
                                                                             components of the gravity recommendation system.
users (”cold-start problem”) and users with a large number
                                                                             SIGKDD Explorations, 9(2), 2007.
of items (”power-users”) [10].
                                                                        [12] S. Vargas and P. Castells. Rank and relevance in
                                                                             novelty and diversity metrics for recommender
6.   ACKNOWLEDGEMENTS                                                        systems. In Proceedings of the fifth ACM conference
  This work is funded by the DFG (Deutsche Forschungs-                       on Recommender systems, RecSys ’11, pages 109–116,
gemeinschaft) in the scope of the LSR project.                               New York, NY, USA, 2011. ACM.


                                                                   32