<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Published by CEUR-WS.org</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Novel Multidimensional Framework for Evaluating Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artus Krohn-Grimberghe</string-name>
          <email>artus@ismll.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandros Nanopoulos</string-name>
          <email>nanopoulos@ismll.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lars Schmidt-Thieme</string-name>
          <email>schmidt-thieme@ismll.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Systems and, Machine Learning Lab, University of Hildesheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <volume>612</volume>
      <fpage>34</fpage>
      <lpage>41</lpage>
      <abstract>
        <p>The popularity of recommender systems has led to a large variety of their application. This, however, makes their evaluation a challenging problem, because di erent and often contrasting criteria are established, such as accuracy, robustness, and scalability. In related research, usually only condensed numeric scores such as RMSE or AUC or F-measure are used for evaluation of an algorithm on a given data set. It is obvious that these scores are insu cient to measure user satisfaction. Focussing on the requirements of business and research users, this work proposes a novel, extensible framework for the evaluation of recommender systems. In order to ease user-driven analysis we have chosen a multidimensional approach. The research framework advocates interactive visual analysis, which allows easy re ning and reshaping of queries. Integrated actions such as drill-down or slice/dice, enable the user to assess the performance of recommendations in terms of business criteria such as increase in revenue, accuracy, prediction error, coverage and more. The ability of the proposed framework to comprise an effective way for evaluating recommender systems in a businessuser-centric way is shown by experimental results using a research prototype.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The popularity of recommender systems has resulted in a
large variety of their applications, ranging from presenting
personalized web-search results over identifying preferred
multimedia content (movies, songs) to discovering friends
in social networking sites. This broad range of applications,
however, makes the evaluation of recommender systems a
challenging problem. The reason is the di erent and often
contrasting criteria that are being involved in real-world
applications of recommender systems, such as their accuracy,
robustness, and scalability.</p>
      <p>The vast majority of related research usually evaluates
recommender system algorithms with condensed numeric
scores: root mean square error (RMSE) or mean absolute
error (MAE) for rating prediction, or measures usually
stemming from information retrieval such as precision/recall or
F-measure for item prediction. Evidently, although such
measures can indicate the performance of algorithms
regarding some perspectives of recommender systems' applications,
they are insu cient to cover the whole spectrum of aspects
involved in most real-world applications. As an alternative
approach towards characterizing user experience as a whole,
several studies employ user-based evaluations. These
studies, though, are usually rather costly, di cult in design and
implementation.</p>
      <p>More importantly, when recommender systems are
deployed in real-world applications, notably e-commerce, their
evaluation should be done by business analysts and not
necessarily by recommender-system researchers. Thus, the
evaluation should be exible on testing recommender algorithms
according to business analysts' needs using interactive queries
and parameters. What is, therefore, required is to
provide support for evaluation of recommender systems'
performance based on popular online analytical processing (OLAP)
operations. Combined with support for visual analysis,
actions such as drill-down or slice/dice, allow assessment of the
performance of recommendations in terms of business
objectives. For instance, business analysts may want to examine
various performance measures at di erent levels (e.g.,
hierarchies in categories of recommended products), detect trends
in time (e.g., elevation of average product rating following a
change in the user interface), or segment the customers and
identify the recommendation quality with respect to each
customer group. Furthermore, the interactive and visual
nature of this process allows easy adaptation of the queries
according to insights already gained.</p>
      <p>In this paper, we propose a novel approach to the
evaluation of recommender systems. Based on the
aforementioned motivation factors, the proposed methodology builds
on multidimensional analysis, allowing the consideration of
various aspects important for judging the quality of a
recommender system in terms of real-world applications. We
describe a way for designing and developing the proposed
extensible multidimensional framework, and provide insights
into its applications. This enables integration, combination
and comparison of both, the presented and additional,
measures (metrics).</p>
      <p>To assess the bene ts of the proposed framework, we have
implemented a research prototype and now present
experimental results that demonstrate its e ectiveness.</p>
      <p>Our main contributions are summarized as follows:</p>
      <p>A exible multidimensional framework for evaluating
recommender systems.</p>
      <p>Copyright © 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.</p>
      <p>This volume is published and copyrighted by its editors: Knijnenburg, B.P., Schmidt-Thieme, L., Bollen, D.
A comprehensive procedure for e cient development
of the framework in order to support analysis of both,
dataset facets and algorithms' performance using
interactive OLAP queries (e.g., drill-down, slice, dice).</p>
      <p>The consideration of an extended set of evaluation
measures, compared to standards such as the RMSE.</p>
      <p>Experimental results with intuitive outcomes based on
swift visual analysis.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        For general analysis of recommender systems, Breese [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and Herlocker et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] provide a comprehensive overview
of evaluation measures with the aim of establishing
comparability between recommender algorithms. Nowadays, the
generally employed measures within the prevailing
recommender tasks are MAE, (R)MSE, precision, recall, and
Fmeasure. In addition further measures including con dence,
coverage and diversity related measures are discussed but
not yet broadly used. Especially the latter two have
attracted attention over the last years as it is still not certain
whether today's predictive accuracy or precision and recall
related measures correlate directly with interestingness for
a system's end users. As such various authors proposed and
argued for new evaluation measures [
        <xref ref-type="bibr" rid="ref21 ref22 ref6">22, 21, 6</xref>
        ]. Ziegler [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
has analyzed the e ect of diversity with respect to user
satisfaction and introduced topic diversi cation and intra-list
similarity as concepts for the recommender system
community. Zhang and Hurley [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] have improved the intra-list
similarity and suggested several solution strategies to the
diversity problem. Celma and Herrera [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have addressed the
closely related novelty problem and propose several
technical measures for coverage and similarity of item
recommendation lists. All these important contributions focus on
reporting single aggregate numbers per dataset and algorithm.
While our framework can deliver those, too, it goes beyond
that by its capability of combining the available measures
and, most importantly, dissecting them among one or more
dimensions.
      </p>
      <p>
        Analysis of the end users' response to recommendations
and their responses' correlation with the error measures used
in research belongs to the eld of Human-Recommender
Interaction. It is best explored by user studies and large scale
experiments, but both are very expensive to obtain and thus
rarely conducted and rather small in scale. Select studies are
[
        <xref ref-type="bibr" rid="ref13 ref14 ref4">13, 14, 4</xref>
        ]. Though in the context of classical information
retrieval, Joachims et al [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have conducted a highly
relevant study on the biasing e ect of the position an item has
within a ranked list. In the context of implicit feedback vs.
explicit feedback Jones et al [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have conducted an
important experiment on the preferences of users concerning
recommendations generated by unobtrusively collected implicit
feedback compared to recommendations based on explicitly
stated preferences. Bollen et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have researched the
effect of recommendation list length in combination with
recommendation quality on perceived choice satisfaction. They
found that for high quality recommendations, longer lists
tend to overburden the user with di cult choice decisions.
Against the background of those results we believe that for
initial research on a dataset, forming an idea, checking if
certain e ects are present, working on collected data with
a framework like the one presented is an acceptable proxy.
With ndings gained in this process, conducting meaningful
user studies is an obvious next step.
      </p>
      <p>
        Recent interesting ndings with respect to dataset
characteristics are e.g. the results obtained during the Net ix
challenge [
        <xref ref-type="bibr" rid="ref17 ref3">3, 17</xref>
        ] on user and item base- e ects and time-e ects in
data. When modeled appropriately, they have a noteworthy
e ect on recommender performance. The long time it took
to observe these properties of the dataset might be an
indicator for the fact that with currently available tools proper
analysis of the data at hand is more di cult and tedious
than it should be. This motivates the creation of
easy-touse tools enabling thorough analysis of the datasets and the
recommender algorithm's results and presenting results in
an easy to consume way for the respective analysts.
      </p>
      <p>
        Notable work regarding the integration of OLAP and
recommender systems stems from the research of Adomavicius
et al. [
        <xref ref-type="bibr" rid="ref1 ref2">2, 1</xref>
        ]. They treat the recommender problem setting
with its common dimensions of users, items, and rating as
inherently multidimensional. But unlike this work, they focus
on the multidimensionality of the generation of
recommendations and on the recommenders themselves being
multidimensional entities that can be queried like OLAP cubes
(with a speci cally derived query language, RQL). In
contrast, our work acknowledges the multidimensional nature
of recommender systems, but focusses on their
multidimensional evaluation.
      </p>
      <p>
        Existing frameworks for recommender systems analysis
usually focus on the automatic selection of one
recommendation technique over another. E.g., [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is focussed on an API
that allows retrieval and derivation of user satisfaction with
respect to the recommenders employed. The AWESOME
system by Thor and Rahm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], the closest approach to that
presented here, shares the data warehousing approach, the
description of the necessary data preparation (ETL), and
the insight of breaking down the measures used for
recommender performance analysis by appropriate categories. But
contrary to the approach presented here, the AWESOME
framework is solely focussed on website performance and
relies on static SQL-generated reports and decision criteria.
Furthermore, it incorporates no multidimensional approach
and does not aim at simplifying end-user-centric analysis or
interactive analysis at all.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. FRAMEWORK REQUIREMENTS 3.1</title>
    </sec>
    <sec id="sec-4">
      <title>The Role of a Multidimensional Model</title>
      <p>
        Business analysts expect all data of a recommender
systems (information about items, generated recommendations,
user preferences, etc.) to be organized around business
entities in form of dimensions and measures based on a
multidimensional model. A multidimensional model enforces
structure upon data and expresses relationships between data
elements [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Such a model, thus, allows business analysts
to investigate all aspects of their recommender system by
using the popular OLAP technology [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This technology
provides powerful analytical capabilities that business
analysts can query to detect trends, patterns and anomalies
within the modeled measures of recommender systems'
performance across all involved dimensions.
      </p>
      <p>Multidimensional modeling provides comprehensibility for
the business analysts by organizing entities and attributes
of their recommender systems in a parent-child relationship
(1:N in databases terminology), into dimensions that are
identi ed by a set of attributes. For instance, the dimension
of recommended items may have as attributes the name of
the product, its type, its brand and category, etc. For the
business analyst, the attributes of a dimension represent a
speci c business view on the facts (or key performance
indicators), which are derived from the intersection entities. The
attributes of a dimension can be organized in a hierarchical
way. For the example of a dimension about the user of the
recommender systems, such a hierarchy can result from the
geographic location of the user (e.g., address, city, or
country). In a multidimensional model, the measures (sometimes
called facts) are based in the center with the dimensions
surrounding them, which forms the so called star schema that
can be easily recognized by the business analysts. The star
schema of the proposed framework will be analyzed in the
following section.</p>
      <p>It is important to notice that aggregated scores, such as
the RMSE, are naturally supported. Nevertheless, the power
of a multidimensional model resides in adding further
derived measures and the capability of breaking all measures
down along the dimensions de ned in a very intuitive and
highly automated way.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Core Features</title>
      <p>Organizing recommender data in a principled way
provides automation and tool support. The presented
framework enables analysis of all common recommender datasets.
It supports both rating prediction and item
recommendation scenarios. Besides that, data from other application
sources can and should be integrated for enriched analysis
capabilities. Notable sources are ERP systems, eCommerce
systems and experimentation platform systems employing
recommender systems. Their integration leverages analysis
of the recommender data by the information available within
the application (e.g., recommender performance given the
respective website layouts) and also analysis of the
application data by recommender information (e.g., revenue by
recommender algorithm).</p>
      <p>Compared to RMSE, MAE, precision, recall, and F-measure,
more information can be obtained with this framework as,
rst, additional measures e.g. for coverage, novelty,
diversity analysis are easily integrated and thus available for all
datasets. Second, all measures are enhanced by the
respective ranks, (running) di erences, (running) percentages,
totals, standard deviations and more.</p>
      <p>
        While a single numerical score assigned to each
recommender algorithm's predictions is crucial for determining
winners in challenges or when choosing which algorithm to
deploy [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], from an business insight point of view a lot of
interesting information is forgone this way. Relationships
between aspects of the data and their in uence on the measure
may be hidden. One such may be deteriorating increase in
algorithmic performance with respect to an increasing
number of rating available per item, another the development of
the average rating over the lifetime of an item in the
product catalog. A key capability of this framework is exposing
intuitive ways for analyzing the above measures by other
measures or related dimensions.
      </p>
      <p>From a usability point of view, this framework contributes
convenient visual analysis empowering drag-drop analysis
and interactive behavior. Furthermore, convenient visual
presentation of the obtained results is integrated from the
start as any standard conforming client can handle it.
Manual querying is still possible as is extending the capabilities of
the framework with custom measures, dimensions, or
functions and post-processing of received results in other
applications. Inspection of the original source data is possible via
custom actions which allow the retrieval of the source rows
that produced the respective result. Last but not least,
aggregations allow for very fast analysis of very large datasets,
compared to other tools.</p>
      <p>The following section elaborates on the architecture of the
multidimensional model that is used by the proposed
framework, by providing its dimensions and measures.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>THE ARCHITECTURE OF THE MULTI</title>
    </sec>
    <sec id="sec-7">
      <title>DIMENSIONAL FRAMEWORK</title>
      <p>Figure 1 gives an overview of the architecture of the
framework. The source data and the extract-transform-load (ETL)
process cleaning it and moving it into the data store are
located at the bottom of the framework. The middle tier
stores the collected information in a data warehouse manner
regarding facts (dashed boxes in the center) and dimensions
(surrounding the facts). The multidimensional cubes (for
rating recommendation and item prediction) sitting on top
of the data store provide access to an extended set of
measures (derived from the facts in the warehouse) that allow
automatic navigation along their dimensions and interaction
with other measures.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>The Data Flow</title>
      <p>The data gathered for analysis can be roughly divided into
two categories:
Core data : consisting of the algorithms' training data, such
as past ratings, purchase transaction information,
online click streams, audio listening data, ... and the
persisted algorithms' predictions.</p>
      <p>Increase-insight data : can be used as a means to
leverage the analytic power of the framework. It consists
roughly of user master data, item master data, user
transactional statistics, and item transactional
statistics. This data basically captures the metadata and
usage statistics data not directly employed by current
recommender algorithms (such as demographic data,
geographic data, customer performance data. . . ).</p>
      <p>In case of recommender algorithms employed in
production environments, relational databases housing the
transactional system (maybe driving an e-commerce system like
an ERP system or an online shop) will store rich business
master data such as item and user demographic information,
lifetime information and more, next to rating information,
purchase information, and algorithm predictions. In case of
scienti c applications, di erent text les containing e.g.
rating information, implicit feedback, and the respective user
and item attributes for training and the algorithms'
predictions are the traditional source of the data.</p>
      <p>From the respective source, the master data, the
transactional data, and the algorithm predictions are cleaned,
transformed, and subsequently imported into a data warehouse.
Referential integrity between the elements is maintained, so
that e.g. ratings to items not existing in the system are
impossible. Incongruent data is spotted during insert into the
recommender warehouse and presented to the data expert.</p>
      <p>Inside the framework, the data is logically split into two
categories: measures (facts) that form the numeric
information for analysis, and dimensions that form the axes of
analysis for the related measures. In the framework schema
( gure 1), the measures are stylized within the dashed boxes.
The dimensions surrounding them and are connected to both,
the rating prediction and the item recommendation
measures.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>The Measures</title>
      <p>Both groups of measures analyzed by the framework|
the measures for item recommendation algorithms and the
measures for rating prediction algorithms|can be divided
into basic statistical and information retrieval measures.
Statistical measures: Among the basic statistical measures
are counts and distinct counts, ranks, (running)
differences and (running) percentages of various totals
for each dimension table, train ratings, test ratings
and predicted ratings; furthermore, averages and their
standard deviations for the lifetime analysis, train
ratings, test ratings, and predicted ratings.</p>
      <p>Information retrieval measures: Among the information
retrieval measures are the popular MAE and (R)MSE
for rating prediction, plus user-wise and item-wise
aggregated precision, recall and F-measure for item
prediction. Novelty, diversity, and coverage measures are
also included as they provide additional insight.
Furthermore, for comparative analysis, the di erences in
the measures between any two chosen (groups of)
prediction methods are supported as additional measures.</p>
      <p>In case a recommender system and thus this framework is
accompanied by a commercial or scienti c application, this
application usually will have measures of its own. These
measures can easily be integrated into the analysis. An
example may be an eCommerce application adding sales
measures such as gross revenue to the framework. These external
measures can interact with the measures and the dimension
of the framework.1
1E.g., the revenue could be split up by year and
recommen4.3</p>
    </sec>
    <sec id="sec-10">
      <title>The Dimensions</title>
      <p>The dimensions are used for slicing and dicing the selected
measures and for drilling down from global aggregates to ne
granular values. For our framework, the dimensions depicted
in gure 1 are:
Date: The Date dimension is one of the core dimensions
for temporal analysis. It consists of standard
members such as Year, Quarter, Month, Week, Day and
the respective hierarchies made up from those
members. Furthermore, Year-to-date (YTD) and
Quarter/Month/Week/Day of Year logic provides options
such as searching for a Christmas or Academy Awards
related e ect.</p>
      <p>Time: The Time dimension o ers Hour of Day and Minute
of Day/Hour analysis. For international datasets this
dimension pro ts from data being normalized to the
time zone of the creator (meaning the user giving the
rating).</p>
      <p>Age: The Age dimension is used for item and user lifetime
analysis. Age refers to the relative age of the user
or item at the time the rating is given/received or an
item from a recommendation list is put into a shopping
basket and allows for analysis of trends in relative time
(c.f. section 6).</p>
      <p>User : User and the related dimensions such as UserPro le
and UserDemographics allow for analysis by user
master data and by using dynamically derived
information such as activity related attributes. This enables
grouping of the users and content generated by them
(purchase histories, ratings) by information such as #
of ratings or purchases, # of days of activity, gender,
geography...</p>
      <p>Item : Item and the related dimensions such as
ItemCategory and ItemComponent parallel the user-dimensions.
In a movie dataset, the item components could be, e.g.,
actors, directors, and other credits.</p>
      <p>Prediction Method : The Prediction Method dimension
allows the OLAP user to investigate the e ects of the
various classes and types or recommender systems and
their respective parameters. Hierarchies, such as
Recommender Class, Recommender Type, Recommender
Parameters, simplify the navigation of the data.
eCommerce: As recommender algorithms usually
accompany a commercial or scienti c application (e.g.,
eCommerce) having dimensions of its own, these dimensions
can easily be integrated into and be used by our
framework.</p>
      <p>
        Experimentation : In case this framework is used in an
experiment-driven scenario [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], such as an online or
marketing setting, Experimentation related dimensions
should be used. They parallel the PredictionMethod
dimension, but are more speci c to their usage
scenario.
dation method, showing the business impact of a
recommender.
      </p>
    </sec>
    <sec id="sec-11">
      <title>5. PROTOTYPE DESCRIPTION</title>
      <p>
        This section describes the implementation of a research
prototype for the proposed framework. The prototype was
implemented using Microsoft SQL Server 2008 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and was
used later for our performance evaluation.
      </p>
      <p>
        In our evaluation, the prototype considers the Movielens
1m dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is a common benchmark for
recommender systems. It consists of 6.040 users, 3.883 items, and
1.000.209 ratings received over roughly three years. Each
user has at least 20 ratings and the metadata supplied for
the users is userId, gender, age bucket, occupation, and
zipcode. Metadata for the item is movieId, title and genre
information.
      </p>
      <p>
        Following a classical data warehouse approach [
        <xref ref-type="bibr" rid="ref12 ref15">15, 12</xref>
        ], the
database tables are divided into dimension and fact tables.
The dimension tables generally consist of two kinds of
information: static master data and dynamic metadata. The
static master data usually originates from an ERP system or
another authoritative source and contains e.g. naming
information. The dynamic metadata is derived information
interesting for evaluation purposes, such as numbers of ratings
given or time spent on the system. To allow for an always
up to date and rich information at the same time, we
follow the approach of using base tables for dimension master
data and views for dynamic metadata derived through
various calculations. Further views then expose the combined
information as pseudo table. The tables used in the
warehouse of the prototype are Date, Time, Genre (instantiation
of Category), Item, ItemGenre (table needed for mapping
items and genres), Numbers (a helper table), Occupation,
PredictedRatings, PredictedItems, PredictionMethod,
TestRatings, TestItems, TrainRatings, TrainItems, and User.
The Item and User table are in fact views over the
master data provided with the Movielens dataset and dynamic
information gathered from usage data. Further views are
SquareError, UserwiseFMeasure, AllRatings, and
AgeAnalysis.
      </p>
      <p>On top of the warehouse prototype, an OLAP cube for
rating prediction was created using Microsoft SQL Server
Analysis Services. Within this cube, the respective
measures were created: counts and sums, and further derived
measures such as distinct counts, averages, standard
deviations, ranks, (running) di erences and (running)
percentage. The core measures RMSE and MAE are derived from
the error between predicted and actual ratings. The most
important OLAP task with respect to framework
development is to de ne the relationships between the measures and
dimensions, as several dimensions are linked multiple times
(e.g. the Age dimension is role-playing as it is linked against
both item age and user age) or only indirect relationships
exist (such as between category and rating the relationship
is only established via item). Designing the relationships
has to be exercised very carefully, as both correctness of the
model and the ability to programmatically navigate
dimensions and measures (adding them on the report axes,
measure eld or as lters) depend on this step. Linking
members enables generic dimensions such as Prediction Method
A, and Prediction Method B, that can be linked to chosen
dimension members. This renders unnecessary the creation
of the n(n 1)=2 possible measures yielding di erences
between any two prediction methods A and B (for, say, RMSE
or F-measure). Furthermore, this approach allows choosing
more than one dimension member, e.g. several runs of one
algorithm with di erent parameters, as one linked member
for aggregate analysis.</p>
      <p>Before we go on to the evaluation of our prototype, let
us state that our framework describes more than simply a
model for designing evaluation frameworks. The prototype
serves well as a template for other recommender datasets,
too. With nothing changed besides the data load procedure,
it can be used directly for, e.g., the other Movielens datasets,
the Net ix challenge dataset or the Eachmovie dataset.
Additional data available in those datasets (e.g. the tagging
information from the Movielens 10m dataset) are either
ignored or require an extension of the data warehouse and the
multidimensional model (resulting in new analysis
possibilities).</p>
    </sec>
    <sec id="sec-12">
      <title>6. PERFORMANCE EVALUATION</title>
      <p>In the previous section we have described the
implementation of a research prototype of the proposed framework using
the Movielens 1m dataset. Building on this prototype, we
proceed with presenting a set of results that are obtained by
applying it.</p>
      <p>
        We have to clarify that the objective of our experimental
evaluation is not limited to the comparison of speci c
recommender algorithms, as it is mostly performed in works
that propose such algorithms. Our focus is, instead, on
demonstrating the exibility and easiness with which we can
answer important questions for the performance of
recommendations. It is generally agreed that explicitly modelling
the e ects describing changes in the rating behavior over
the various users (user base-e ect), items (item base-e ect),
and age of the respective item or user (time e ects) [
        <xref ref-type="bibr" rid="ref16 ref17 ref3">3, 16,
17</xref>
        ]. For this reason, we choose to demonstrate the
bene ts of the proposed framework by setting our scope on
those e ects followed by exemplary dissecting the
performance of two widely examined classes of recommender
algorithms, i.e., collaborative ltering and matrix factorization.
We also consider important the exploratory analysis of items
and users, which can provide valuable insights for business
analysts about factors determining the performance of their
recommender systems. We believe that the results presented
in the following demonstrate how easy it is to obtain them
by using the proposed framework, which favors its usage in
real-world applications, but also can provide valuable
conclusions to motivate the usage of the framework for pure
research purpose, since it allows for observing and analyzing
the performance by combining all related dimensions that
are being modeled.
      </p>
      <p>All results presented in the remainder of this section could
easily be obtained graphically by navigating the presented
measures and dimensions using Excel 2007 as
multidimensional client.
6.1</p>
    </sec>
    <sec id="sec-13">
      <title>Exploratory Data Analysis</title>
      <p>Using the framework, the rst step for a research and a
business analytics approach is exploring the data. As an
example, the Calendar dimension (Date) is used to slice
the average rating measure. Figure 2 presents this as pivot
chart. The sharp slumps noticeable in March and August
2002 together with a general lack of smoothness beyond mid
2001 arouse curiosity and suggest replacing average rating
by rating count ( gure not shown). Changing from counts
to running percentages proves that about 50 percent of the
ratings in this dataset are spent within the rst six months
3.7
3.6
out of nearly three years. Within two more months 90
percent of the ratings are assigned, roughly seven percent of the
data for 50 percent of the time ( gure 3).</p>
      <sec id="sec-13-1">
        <title>6.1.1 Item Analysis</title>
        <p>
          The framework allows an easy visualization of the item
e ect described e.g. in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], namely that there usually is a
systematic variation of the average rating per item.
Aditionally, other factors can easily be integrated in such an
analysis. Figure 4 shows the number of ratings received
per item sorted by decreasing average rating. This
underlines the need for regularization when using averages, as the
movies rated highest only received a vanishing number of
ratings.
        </p>
        <p>
          Moving on the x-axis from single items to rating count
buckets containing a roughly equal number of items, a trend
of heavier rated items being rated higher can be observed
( gure omitted for space reasons). A possible explanation
might be that blockbuster movies accumulate a huge
number of generally positive ratings during a short time and the
all-time classics earn a slow but steady share of additional
coverage. That all-time classics receive higher ratings can
nicely be proved with the framework, too. Consistent with
ndings during the nal phase of the Net ix competition by
Koren [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], gure 5 shows a justi cation for the good
results obtained by adding time-variant base e ects to
recommender algorithms. Besides the all-time classics e ect, the
blockbuster e ect can also be observed ( gure 6), showing
that items who receive numerous ratings per day on average
also have a higher rating.
        </p>
      </sec>
      <sec id="sec-13-2">
        <title>6.1.2 User Analysis</title>
        <p>The user e ect can be analyzed just as easy as the item
effect. Reproducing the analysis explained above on the users,
3.8
-21 -24 -27 31 35 39 45 49 56 63 70 79 89 00 10 23 38 55 73 01 28 68 11 77 69 306 1432</p>
        <p>- - - - - - - - - - 1 -1 -1 -14 -19 -16 -24 -22 -29 -39 -32 -48 -0
20 22 25 28 32 36 40 46 50 57 64 71 80 -09 011 111 12 13 15 17 20 22 26 31 37 47 316</p>
        <p>Number of Ratings per User
it is interesting to notice that for heavy raters the user
rating count e ect is inverse to the item rating count e ect
described above ( gure 7): the higher the amount of ratings
spent by a given user, the lower his or her average rating.
One explanation to this behavior might be that real heavy
raters encounter a lot of rather trashy or at least low quality
movies.</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>6.2 Recommender Model Diagnostics</title>
      <p>For algorithm performance comparison, the Movielens 1m
ratings were randomly split into two nearly equal size
partitions, one for training (500103), and one for testing (500104
ratings). Algorithm parameter estimation was conducted on
the training samples only, predictions were conducted solely
on the test partition. Exemplarily, a vanilla matrix
factorization (20 features, regularization 0.09, learn rate 0.01,
56 iterations, hyperparameters optimized by 5-fold
crossvalidation) is analyzed.2</p>
      <p>For a researcher the general aim will be to improve the
overall RMSE or F-Measure, depending on the task, as this
is usually what wins a challenge or raises the bar on a given
dataset. For a business analyst this is not necessarily the
case. A business user might be interested in breaking down
the algorithm's RMSE over categories or top items or top
users as this may be relevant information from a monetary
aspect. The results of the respective queries may well lead
to one algorithm being replaced by another on a certain part
of the dataset (e.g. subset of the product hierarchy).</p>
      <p>In gure 8, RMSE is plotted vs. item rating count in
train.This indicates that more ratings on an item do help
factor models. Interpreted the other way around, for a
business user, this implies that this matrix factorization yields
best performance on the items most crucial to him from a
top sales point of view (though for slow seller other
algorithms might be more helpful).</p>
      <p>The same trend can be spotted when RMSE is analyzed by
user rating count on the training set ( gure omitted for space
reasons), though the shape of the curve follows a straighter
line than for the item train rating count (where it follows
more an exponential decay).</p>
      <p>Due to the approach taken in the design of the OLAP
cube the number of recommender algorithms comparable as
A and B is not limited; neither does it have to be exactly
one algorithm being compared with exactly one other, as
2The matrix factorization yielded an RMSE of 0.8831 given
the presented train-test split.
-00 -11 -24 -58 -921 -3116 -1712 -2272 -2843 -3534 -4435 -5466 -6780 -8159 -6149 11536 1365 1696 1941 2404 3088 3851 52155</p>
      <p>1 -1 -17 -16 -72 -23 -63 -95 -71</p>
      <p>Item Train Rating Count
multiple selection is possible. Furthermore|given the
predictions are already in the warehouse|replacing one method
by another or grouping several methods as A or B can nicely
be achieved by selecting them in the appropriate drop-down
list. Exemplarily, the matrix factorization analyzed above
is compared to the global average of ratings as baseline
recommendation method. Figure 9 reveals that for this factor
model more ratings on train do increase the relative
performance, as expected, up to a point from which the static
baseline method will gain back roughly half the lost ground.
Investigation of this issue might be interesting for future
recommender models.</p>
      <p>All results presented could be obtained very fast: when
judging the time needed to design query and report (chart)|
which was on average seconds for construction of the query
and making the chart look nice|, and when judging
execution time|which was in the sub-second timeframe.</p>
    </sec>
    <sec id="sec-15">
      <title>7. CONCLUSIONS</title>
      <p>We have proposed a novel multidimensional framework
for integrating OLAP with the challenging task of
evaluating recommender systems. We have presented the
architecture of the framework as a template and described the
implementation of a research prototype. Consistent with
the other papers at this workshop, the authors of this work
believe that the perceived value of a system largely depends
on its user interface. Thus, this work provides an easy to
use framework supporting visual analysis. Our evaluation
demonstrates, too, some of the elegance of obtaining
observations with the proposed framework. Besides showing
the validity of ndings during the recent Net ix prize on
another dataset, we could provide new insights, too. With
respect to the recommender performance evaluation and the
validity of RMSE as an evaluation metric, it would be
interesting to see if a signi cant di erence in RMSE concerning
the amount of ratings present in the training set would also
lead to signi cant e ects in a related user study.</p>
      <p>In our future work, we will consider the extension of our
research prototype and develop a web-based implementation
that will promote its usage.</p>
    </sec>
    <sec id="sec-16">
      <title>ACKNOWLEDGMENTS</title>
      <p>The authors gratefully acknowledge the co-funding of their
work through the European Commission FP7 project
MyMedia (grant agreement no. 215006) and through the
European Regional Development Fund project LEFOS (grant
agreement no. 80028934).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sankaranarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuzhilin</surname>
          </string-name>
          .
          <article-title>Incorporating contextual information in recommender systems using a multidimensional approach</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>23</volume>
          (
          <issue>1</issue>
          ):
          <volume>103</volume>
          {
          <fpage>145</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuzhilin</surname>
          </string-name>
          .
          <article-title>Multidimensional recommender systems: A data warehousing approach</article-title>
          .
          <source>In WELCOM '01: Proceedings of the Second International Workshop on Electronic Commerce</source>
          , pages
          <volume>180</volume>
          {
          <fpage>192</fpage>
          , London, UK,
          <year>2001</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <article-title>Modeling relationships at multiple scales to improve accuracy of large recommender systems</article-title>
          .
          <source>In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>95</volume>
          {
          <fpage>104</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bollen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Knijnenburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Willemsen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Graus</surname>
          </string-name>
          .
          <article-title>Understanding choice overload in recommender systems</article-title>
          .
          <source>In RecSys '10: Proceedings of the 2010 ACM conference on Recommender systems</source>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Breese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Heckerman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Kadie</surname>
          </string-name>
          .
          <article-title>Empirical analysis of predictive algorithms for collaborative ltering</article-title>
          .
          <source>In MSR-TR-98-12</source>
          , pages
          <fpage>43</fpage>
          {
          <fpage>52</fpage>
          . Morgan Kaufmann,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Celma</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>A new approach to evaluating novel recommendations</article-title>
          .
          <source>In RecSys '08: Proceedings of the 2008 ACM conference on Recommender systems</source>
          , pages
          <volume>179</volume>
          {
          <fpage>186</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Codd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Codd</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Salley</surname>
          </string-name>
          .
          <article-title>Providing OLAP to user-analysts: An it mandate</article-title>
          . Ann Arbor,MI,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Crook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Frasca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kohavi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Longbotham</surname>
          </string-name>
          .
          <article-title>Seven pitfalls to avoid when running controlled experiments on the web</article-title>
          .
          <source>In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>1105</volume>
          {
          <fpage>1114</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] GroupLens.
          <article-title>Movielens data sets</article-title>
          . http://www.grouplens.org/node/73.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Avesani</surname>
          </string-name>
          , and
          <string-name>
            <surname>P. Cunningham.</surname>
          </string-name>
          <article-title>An on-line evaluation framework for recommender systems</article-title>
          . In In Workshop on Personalization and Recommendation in E-Commerce (Malaga. Springer Verlag,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Herlocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Terveen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Riedl</surname>
          </string-name>
          .
          <article-title>Evaluating collaborative ltering recommender systems</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):5{
          <fpage>53</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Inmon</surname>
          </string-name>
          .
          <article-title>Building the Data Warehouse</article-title>
          . Wiley, 4th ed.,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Granka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hembrooke</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gay</surname>
          </string-name>
          .
          <article-title>Accurately interpreting clickthrough data as implicit feedback</article-title>
          .
          <source>In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>154</volume>
          {
          <fpage>161</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>How users perceive and appraise personalized recommendations</article-title>
          .
          <source>In UMAP '09: Proceedings of the 17th International Conference on User Modeling, Adaptation, and Personalization</source>
          , pages
          <volume>461</volume>
          {
          <fpage>466</fpage>
          , Berlin, Heidelberg,
          <year>2009</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kimball</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ross</surname>
          </string-name>
          .
          <article-title>The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling</article-title>
          . Wiley, 2nd ed.,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          .
          <article-title>Factorization meets the neighborhood: a multifaceted collaborative ltering model</article-title>
          .
          <source>In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>426</volume>
          {
          <fpage>434</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          .
          <article-title>Collaborative ltering with temporal dynamics</article-title>
          .
          <source>In KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <volume>447</volume>
          {
          <fpage>456</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          .
          <source>Microsoft SQL Server</source>
          <year>2008</year>
          homepage. http://www.microsoft.com/sqlserver/2008/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>J. O'Brien</surname>
            and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Marakas</surname>
          </string-name>
          .
          <source>Management Information Systems. McGraw-Hill/Irwin</source>
          , 9th ed.,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thor</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Awesome: a data warehouse-based system for adaptive website recommendations</article-title>
          .
          <source>In VLDB '04: Proceedings of the Thirtieth international conference on Very large data bases</source>
          , pages
          <volume>384</volume>
          {
          <fpage>395</fpage>
          . VLDB Endowment,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Hurley</surname>
          </string-name>
          .
          <article-title>Avoiding monotony: improving the diversity of recommendation lists</article-title>
          .
          <source>In RecSys '08: Proceedings of the 2008 ACM conference on Recommender systems</source>
          , pages
          <volume>123</volume>
          {
          <fpage>130</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>C.-N. Ziegler</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>McNee</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Konstan</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Lausen</surname>
          </string-name>
          .
          <article-title>Improving recommendation lists through topic diversi cation</article-title>
          .
          <source>In WWW '05: Proceedings of the 14th international conference on World Wide Web</source>
          , pages
          <volume>22</volume>
          {
          <fpage>32</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>