<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Idomaar: A Framework for Multi-dimensional Benchmarking of Recommender Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Scriminaci</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Lommatzsch</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Kille</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Hopfgartner</string-name>
          <email>frank.hopfgartner@glasgow.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
          <email>m.a.larson@tudelft.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Malagoli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andras Sereny</string-name>
          <email>sereny.andras@gravityrd.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Till Plumbaum</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ContentWise R</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D - Moviri</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastname}@moviri.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TU Berlin - DAI-Lab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berlin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastname}@dai-labor.de</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Gravity R&amp;D</institution>
          ,
          <addr-line>Budapest</addr-line>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Delft</institution>
          ,
          <addr-line>Delft</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Glasgow</institution>
          ,
          <addr-line>Glasgow</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>In real-world scenarios, recommenders face non-functional requirements of technical nature and must handle dynamic data in the form of sequential streams. Evaluation of recommender systems must take these issues into account in order to be maximally informative. In this paper, we present Idomaar-a framework that enables the e cient multi-dimensional benchmarking of recommender algorithms. Idomaar goes beyond current academic research practices by creating a realistic evaluation environment and computing both e ectiveness and technical metrics for stream-based as well as setbased evaluation. A scenario focussing on “research to prototyping to productization” cycle at a company illustrates Idomaar's potential. We show that Idomaar simplifies testing with varying configurations and supports flexible integration of di erent data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION AND MOTIVATION</title>
      <p>
        Increasingly, we witness a shift of recommender system research
toward large-scale systems developed for industry settings. The
trend was already well described in Amatriain’s 2012 tutorial on
building large-scale real-world recommender systems at ACM
RecSys 2012 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Given commercial systems’ complexity and the
demand for high performance, evaluation is subject to additional
requirements: contribution of complementary information, reliablility
on handling large-scale problems, and use of di erent methods and
metrics. Evaluation must allow both o ine parameter tuning as well
as monitoring systems online.
      </p>
      <p>
        Benchmarking the performance of recommender systems by these
aspects is challenging. Mark Levy pointed this out during his
keynote at the ACM RecSys 2013 workshop on Reproducibility and
Replication in Recommender Systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Said and Bellogín [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
concur with his point as they analyze existing frameworks’
abilities. Many commonly used software suites do not provide the
required functionalities to benchmark di erent aspects, or they are
too complex to set up.
      </p>
      <p>
        We introduce Idomaar to address this challenge.1 It enables
researchers to evaluate di erent algorithm with respect to multiple
criteria. The framework uses large-scale static data sets to simulate
live data streams, bringing o ine evaluation closer to online A/B
testing. By comparing the performance of recommender algorithms
operating in a live system (e.g., as studied in the living lab
NewsREEL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and these simulated data streams, the framework can
1Idomaar is available at https://github.com/crowdrec/idomaar
see also http://rf.crowdrec.eu
be used to study the transferability of o ine evaluation to an
online setting. Finally, Idomaar enables multi-dimensional evaluation
which simultaneously measures the performance of algorithms with
respect to precision-related and technical aspects. These cover CTR
and scalability-related measures (throughput and response time).
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH AND FRAMEWORK</title>
      <p>The reference framework Idomaar is a tool to evaluate
recommendation services in real-world settings. As opposed to typical
recommender system evaluation which assumes static information,
real-world applications process data in form of a stream of
information. In fact, users, items, and interaction amid both collections
continue generating events fed to the recommender system. For
instance, new users register or existing users cancel their subscription;
new items emerge; users consume items. Such information must be
ingested and processed as soon as possible (e.g., by updating the
recommendation models) in order to be available. All these messages
are asynchronously handled. However, the system also has to
synchronously serve incoming recommendation requests within strict
time constraints. Practically, the whole flow of incoming messages
is managed by means of queues.</p>
      <p>Idomaar mimics the work flow of such real-world scenario by
using state-of-the-art technologies (e.g., Apache Flume and Apache
Kafka) to manage data streaming. The architecture has been split
into four main modules, as depicted in Fig. 1: the data container, the
evaluator, the orchestrator, and the computing environment.</p>
      <p>
        The data container stores the data (entities and relations, in
accordance to the format defined in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Part of the data bootstraps
the recommender system for training algorithms, while most of the
remaining data feeds the recommender system at real-time while it
has to serve incoming recommendation requests for test purposes.
Finally, the remaining subset of the data, the ground truth, is hidden
from the recommender system and used to evaluate the quality of
the service in terms of the user metrics. The data is read by a custom
Apache Flume source and sent into an Apache Kafka queue.
      </p>
      <p>The recommender system runs within a virtual machine,
referred to as computing environment, whose environment is
created with Vagrant (https://www.vagrantup.com/) and where
all required libraries are automatically provisioned with Puppet
(https://puppetlabs.com/). The recommender system
subscribes to the required Apache Kafka channel and receives the
asynchronous messages (i.e., users, items, interactions, etc.).
Recommendation requests are synchronously sent via a HTTP interface
(or, alternatively, a 0MQ interface).</p>
      <p>The evaluator compares recommendations generated by the
computing environment with the ground truth. In addition to standard
user metrics (RMSE, recall, precision), Idomaar evaluates business
metrics (e.g., scalability, response time, throughput), so to provide a
360-degree evaluation of the recommendation infrastructure.</p>
      <p>Finally, the orchestrator coordinates all processes, including
launching and provisioning the computing environment,
instructing the evaluator to split data into training, test, and ground truth,
feeding the recommender system with the incoming messages in
accordance to their timestamp, collecting the generated
recommendations, and computing the quality metrics.</p>
      <p>Moving from an o ine toward an online scenario (where data
stream is not simulated from historical information, but the real
flow of data) means either replacing the Apache Flume source with
another one (e.g., that reads from log data) or ingesting the data
directly into the Apache Kafka queue.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        Various frameworks have been proposed to facilitate evaluating
recommender systems. Ekstrand et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduce LensKit to
increase comparability of recommender system evaluation. Mahout
is a scalable machine learning toolkit implemented in Java. Both
frameworks ship with a selection of recommendation algorithms
and some evaluators. Gantner et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] created MyMediaLite as a
lightweight recommender system framework. It comprises some
recommendation algorithms along with predefined evaluation
protocols. Said and Bellogín [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed RiVal to facilitate comparing
various recommendation algorithms. The framework’s architecture
supports cross-framework comparisons. The variety in frameworks
emphasizes the demand for tools to evaluate recommender
systems. Although the presented tools support evaluation, all presented
frameworks measure quality only in terms of predictive performance.
Operating recommender systems face additional challenges. For
instance, they might be subject to response time restrictions or
experience heavy load. Finally, running above mentioned frameworks
on di erent hardware still yields inconsistent results. For these
reasons, we propose Idomaar a language-agnostic framework with
cloud-support and the ability to measure time and space complexity.
      </p>
    </sec>
    <sec id="sec-4">
      <title>PROTOTYPE TO PRODUCTIVIZATION</title>
      <p>
        Idomaar was used in the “research to prototyping to operating”
cycle of a recommender system service provider to validate its
usefulness. The focus of the validation was on multidimensional
evaluation that simultaneously takes e ectiveness and technical
constraints into account. The only limitation identified was the
monitoring of performance metrics like CPU and memory usage
(cf. [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ]).
      </p>
      <p>With respect to the cycle itself, we found that having a standard
in terms of data formats and APIs increases the reusability of code
in all phases and helps data scientists to produce code that can be
transformed into e ective prototypes. Our “research to prototyping
to operating” cycle shows:</p>
      <p>Idomaar allowed easy testing using di erent datasets with
different algorithms that share the same input types and subjects
(e.g., implicit or explicit events, sessions or users).</p>
      <p>The Idomaar format is flexible enough to change subjects,
events types, or to integrate contextual information, both on
events and on recommendation requests.</p>
      <p>Idomaar can be considered as a suited tool for recommender
system research: reuse of code speeds up prototyping and
standardization of datasets helps merging di erent data sources.</p>
      <p>In the future, Idomaar will go beyond classical recommender
systems domains (e.g., movies or products) and consider additional
types such as actions or navigation trees. Supporting generic objects
and additional evaluation functions promise to establish Idomaar as
standard research tool for recommender system. Such a standard
could provide valuable support for the current trend of researchers
participating in community-wide recommender system challenges.
Idomaar has already been applied in such a challenge.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>In this paper, we present the Idomaar framework, which enables
the e cient, reproducible evaluation of recommender algorithms
in real-world stream-based scenarios. Idomaar simplifies the
multidimensional evaluation taking into account precision-related metrics
as well as technical aspects.</p>
      <p>Acknowledgment: The research leading to these results was performed
in the CrowdRec project, which has received funding from the EU 7th
Framework Programme FP7/2007-2013 under grant agreement No. 610594.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Amatriain</surname>
          </string-name>
          .
          <article-title>Building industrial-scale real-world recommender systems</article-title>
          .
          <source>In RecSys '12</source>
          , pages
          <fpage>7</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          .
          <article-title>Shedding light on a living lab: the CLEF NEWSREEL open recommendation platform</article-title>
          .
          <source>In IIiX '14</source>
          , pages
          <fpage>223</fpage>
          -
          <lpage>226</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Ekstrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ludwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Riedl</surname>
          </string-name>
          .
          <article-title>Rethinking the recommender research ecosystem: Reproducibility, openness, and lenskit</article-title>
          .
          <source>In RecSys'11</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          .
          <article-title>Mymedialite: A free recommender system library</article-title>
          .
          <source>In RecSys'11</source>
          , pages
          <fpage>305</fpage>
          -
          <lpage>308</lpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Knauerhase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Pu</surname>
          </string-name>
          .
          <article-title>An analysis of performance interference e ects in virtual environments</article-title>
          .
          <source>In ISPASS'07. IEEE</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Levy</surname>
          </string-name>
          .
          <article-title>O ine evaluation of recommender systems: all pain and no gain?</article-title>
          <source>In RecSys</source>
          <year>2013</year>
          , page
          <issue>1</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellogín</surname>
          </string-name>
          .
          <article-title>Comparative recommender system evaluation: Benchmarking recommendation frameworks</article-title>
          .
          <source>In RecSys'14</source>
          , RecSys '
          <volume>14</volume>
          , pages
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellogín</surname>
          </string-name>
          .
          <article-title>Rival: A toolkit to foster reproducibility in recommender system evaluation</article-title>
          .
          <source>In RecSys'14</source>
          , pages
          <fpage>371</fpage>
          -
          <lpage>372</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Loni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          .
          <article-title>An extended data model format for composite recommendation</article-title>
          .
          <source>In RecSys'14 (Posters)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Tickoo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Illikkal</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Newell</surname>
          </string-name>
          .
          <article-title>Modeling virtual machine performance: challenges and approaches</article-title>
          .
          <source>SIGMETRICS Perf. Evaluation Review</source>
          ,
          <volume>37</volume>
          (
          <issue>3</issue>
          ):
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>