<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of CLEF NewsREEL 2015: News Recommendation Evaluation Lab</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Kille</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Lommatzsch</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Turrin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andras Sereny</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martha Larson</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Torben Brodt</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Seiler</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Hopfgartner</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TU Berlin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berlin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>benjamin.kille</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>andreas.lommatzschg@dai-labor.de</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ContentWise R</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D - Moviri</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy roberto.turrin@moviri.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gravity R</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Budapest</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hungary sereny.andras@gravityrd.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TU Delft</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Delft</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Netherlands m.a.larson@tudelft.nl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Plista GmbH</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berlin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>News reader struggle as they face ever increasing numbers of articles. Digital news portals are becoming more and more popular. They route news items to visitors as soon as they are published. The rapid rate at which new news is published gives rise to a selection problem, since the capacity of new portal videos to absorb news is limited. To address this problem, new portals deploy news recommender systems in order to support their visitors in selecting items to read. This paper summarizes the settings and results of CLEF NewsREEL 2015. The lab challenged participants to compete in either a \living lab" (Task 1) or an evaluation that replayed recorded streams (Task 2). The goal was to create an algorithm that was able to generate news items that users would click, respecting a strict time constraint.</p>
      </abstract>
      <kwd-group>
        <kwd>news recommendation</kwd>
        <kwd>recommender systems</kwd>
        <kwd>evaluation</kwd>
        <kwd>living lab</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        News recommendation continues to draw the attention of researchers. Last year's
edition of CLEF NewsREEL [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced the Open Recommendation Platform
(ORP) operated by plista. ORP provides an interface to researchers interested
in news recommendation algorithms. They can easily plug in their algorithms
and receive requests from various news publishers. Subsequently, the systems
records recipients' reaction. This feedback allows participants to improve their
algorithms. In contrast to traditional o ine evaluation, this \living lab"
approach re ects the application setting of an actual news recommender system.
Participants must satisfy technical requirements, and also face technical
challenges. These include response time restrictions, handling peaks in the rate of
requests, and handling continuously changing collections of users and items.
Conceptually, the evaluation represents a fair competition. All participants have the
same chance to receive a request since ORP distributes them randomly. Random
distribution helps avoid selection bias.
      </p>
      <p>In addition to providing fair comparison, the NewsREEL challenge would
like to level the playing eld for all participants. Speci cally, the environments in
which participants operate their recommendation algorithms vary widely. First,
participants' servers have to bypass varying distances to communicate with ORP.
ORP is located in Berlin, Germany. Participants from America, East-Asia, or
Australia face additional network latency compared to participants from
Central Europe. Their performance might su er from failing to serve some requests
only due to latency. Second, participants use di erent hardware and software to
run their algorithms. Suppose a participants has access to a high-performance
cluster. Another participant runs their algorithm on a rather old stand-alone
machine. Is it fair to compare the performance of these participants? The latter
participant may have developed a sophisticated algorithm not perform well in
the competition since they cannot meet the response time requirements.</p>
      <p>This year's edition of CLEF NewsREEL seeks to add another level of
comparison to news recommendation. Our aim is to be able to fairly measure systems
with respect to non-functional requirements, and also allow all participants to
take part in the challenge on equal footing. We continue to o er the \living lab"
evaluation with ORP as Task 1. In addition, we introduce an o ine evaluation
targeted at measuring additional aspects. These aspects include complexity and
scalability. In Task 2, we provide a large data set comprising interactions
between users and various news portals in a two-month time span. Participants
are able to re-run the timestamped events to determine how well their
system scales. We introduce Idomaar, a framework designed to measure technical
parameters along with recommendation quality. Idomaar instantiates virtual
machines. Since these machines share their con guration, we obtain comparable
results. These results do not depend on the actual system. We kept the interfaces
similar to ORP's such that participants could re-use their algorithms with only
a minor adaption e ort.</p>
      <p>The remainder of this lab overview paper is structured as follows. In Section 2,
we introduce the two subtasks of NewsREEL'15. The results of the evaluation
are presented in Section 3. Section 4 concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Lab Setup</title>
      <p>
        CLEF NewsREEL'15 consisted of two subtasks. Task 1 was a repetition of the
online evaluation task (\Task 2") of NewsREEL'14. In Section 2.1, we brie y
introduce the recommendation use case of this task. For a more detailed overview,
the reader is referred to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Section 2.2 introduces the second subtask that
focuses on simulating constant data streams, hence allowing evaluation of real-time
recommenders using an o ine data set. For a more detailed overview of this use
case, we refer to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2.1
      </p>
      <sec id="sec-2-1">
        <title>Task 1: Benchmark News Recommendations in a Living Lab</title>
        <p>
          This task implements the idea of evaluation in a living lab. As such,
participants were given the chance to directly interact with a real-time recommender
system. After registering with The Open Recommendation Platform (ORP) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
provided by plista GmbH, participants receive recommendation requests from
various websites o ering news articles. Requests were triggered by users visiting
those websites.
        </p>
        <p>
          The task followed the idea of providing evaluation as a service [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Participants had access to a virtual machine where they could install their algorithm.
The recommender system forwarded the incoming requests to a random virtual
machine which produced the recommendation to be delivered to the requester.
The random choice was uniformly distributed over all participants. Alternatively,
participants could set up their own server to respond to incoming requests.
        </p>
        <p>As a xed response time limitation was set, the participants experienced
typical restrictions for real-world recommender systems. Such restrictions pose
requirements regarding scalability and computational complexity for the
recommendation algorithms.</p>
        <p>
          ORP monitored the performance of all participants during the challenge
duration by measuring the recommenders' click through rate (CTR). CTR
represents the ratio of clicks by requests. Participants had the chance to continuously
update their parameter settings in order to improve their performance levels.
Results were published on a regular basis to allow participants to compare their
performance with respect to baseline and competing approaches. An overview
of the results is given by Kille et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and also in this paper in Section 3.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Task 2: Benchmarking News Recommendations in a Simulated</title>
      </sec>
      <sec id="sec-2-3">
        <title>Environment</title>
        <p>
          For the second task, we employed the benchmarking framework Idomaar7 that
makes it possible to simulate data streams by \replaying" a recorded stream. The
framework is being developed in the CrowdRec project8 It makes it possible to
execute and test the proposed news recommendation algorithms, independently
of the execution framework and the language used for the development.
Participants of this task had to predict users clicks on recommended news articles in
simulated real-time. The proposed algorithms were evaluated against both
functional (i.e., recommendation quality) and non-functional (i.e., response time)
metrics. The data set used for this task consists of news updates from diverse
news publishers, user interactions and clicks on recommendations. An overview
of the features of the data set is provided by Kille et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <sec id="sec-2-3-1">
          <title>7 http://rf.crowdrec.eu/ 8 http://crowdrec.eu/</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>In this section, we detail results of CLEF NewsREEL 2015. We start by giving
some statistics about the participation in general. Then, we discuss the results
for both tasks.
3.1</p>
      <sec id="sec-3-1">
        <title>Participation</title>
        <p>Forty-two teams registered for CLEF NewsREEL 2015. Of these teams, 38 teams
expressed interest in both tasks. A single participant registered for Task 2 only.
Three teams wanted to focus on Task 1. Participating teams distribute across
the world including all continents except Australia. ORP's operators, plista,
provided ve virtual machines to participants who were located far from Berlin,
Germany. Without these machines participants would have faced issues with
network latency, already discussed above.</p>
        <p>
          Nine teams actively competed in Task 1. The competition's schedule
consisted of three evaluation time frames: 17{23 March, 7{13 April, and 5 May
to 2 June 2015. Seven out of nine teams competed in all three periods. Team
\irit-imt" stopped competing after the second period. Team \university of essex"
entered the competition as the nal period started. Each team could operate
several recommendation services. Each recommendation service obtained a similar
volume of requests if active for similar times. We received a submission
describing the idea and results of team \cwi" [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Baselines</title>
        <p>
          Within the evaluation, we sought to obtain comparable results. Baselines allow
us to determine how well a participant performs relative to a very basic approach.
In last year's edition of NewsREEL, we established the baseline discussed in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
This baseline allocates an array of xed length for item references. As we observe
visitors interacting with the news portal, we put item references into the array.
As we receive a recommendation request, we reversely iterate the array returning
the rst item references that are unknown to the target user. In this way, the
baseline considers both freshness and popularity. We operated the baseline on
two machines, \riemannzeta" and \gaussiannoise", which represented two
different levels of machine power. The team \riemannzeta" administered a virtual
machine with a dual-core Intel Xeon X7560 @ 2:27 GHz, 2 GB of RAM, and 8 GB
hard drive. The team \gaussiannoise" operated a more powerful virtual machine
with a quad-core Intel Xeon X7550 @ 2:0 GHz, 8 GB of RAM, and 26 GB hard
drive. We released the baseline approach in form of a tutorial. Participants could
take advantage of the baseline. Additionally, we sought to establish
comparability with respect to last year's winner. Last year's winning approach has been
documented in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The approach competed as \abc" and in a slightly adjusted
version as \arti cial intelligence", also described in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Results</title>
        <p>Task 1 We observed nine teams actively participating throughout CLEF
NewsREEL 2015. We recorded the performance of participants during three periods:
17{23 March, 7{13 April, and 5 May - 2 June 2015. The former two periods
span a week each; the latter amounts to four weeks of data. The schedule had
intentional gaps between the periods allowing participants to improve their
algorithms. Table 1 summarizes the performances on team level. Each team has
the number of requests (R), number of clicks (C), and their proportion (C/R)
assigned for each of the three periods. Fields with `n/a' refer to lack of
participation. The highest average CTR per time slot is typeset in bold face. We
observe that these values increased as the competition progressed. This indicates
that teams managed to improve their recommendation algorithms over time. In
addition, this could signal that teams learned to adjust their systems better to
the challenge's requirements. Team \irit-imt" received 44 clicks at 5597 requests
leading to the highest CTR (0:79 %) in the time slot from 17{23 March. Team
\abc" received 56 clicks at 6483 requests obtaining a CTR of 0:86 % surpassing
all competitors in the time slot from 7{13 April. Team \arti cial intelligence"
collected 302 clicks at 23 756 requests resulting in a CTR of 1:27 % in the nal
four week time slot.</p>
        <p>Each participant could simultaneously operate several recommendation
engines. Some participants took advantage of this o er. Consequently, those teams
accumulated considerably more requests than others. Figure 1 illustrates the
performance of individual algorithms. We present performance on a plane de ned
by the number of clicks and requests. A point on this plane refers to a speci c
CTR. Points' colors refer to the respective team. The teams \cwi" and
\riadigdl" deployed several algorithms. Two lines depict two CTR levels. A drawn
through line marks the 1:0 % level. A dashed line represents the 0:5 % level. The
illustration con rms that teams \abc" and \arti cial intelligence" outperformed
their competitors.</p>
        <p>400
Clicks
300
200
100
0
0
Teams
abc
artificial intelligence
cwi
gaussiannoise
insight-centre
riadi-gdl
riemannzeta
university of essex
CTR = 1.0%</p>
        <p>CTR = 0.5%
104
2×104
3×104 Requests</p>
        <p>We investigate how individual algorithms perform over time. Figure 2
displays 16 algorithms' CTR relative to the average CTR over the nal evaluation
period's 28 days. Areas below 0 indicate a CTR lower than the average CTR
of that day. Areas above 0 represent days with above average CTR. First, we
observe that only a subset of algorithms ran throughout the period. Algorithms
A, C, E, and K operated only scarcely. Algorithms F (\arti cial intelligence")
and J (\abc") managed to perform above the average CTR on almost all days.
The majority of algorithms' CTR uctuates around the system's average CTR.
This con rms the di culty inherent to news recommendation. The choice of an
algorithm may depend on factors which are subject to change.</p>
        <p>The competition featured a variety of news publishers. Some provide
general as well as regional news. Other news portals specialize on topics such
as sports or information technology. Figure 3 relates 16 competing algorithms
with four major publishers. Publishers \418" (www.ksta.de) and \1677" (www.
tagesspiegel.de) provide general and regional news. Publisher \35774" (www.
sport1.de) targets sport-related news stories. Publisher \694" (www.gulli.com)
presents information technology news. Combined, they account for 85 % of
recommendation requests. The heatmap illustrates higher CTR with darker shades.
CTR ranges up to 2:5 % for some combinations of publishers and algorithms. We
B 0
C 0
D 0
E 0
F 0
G 0
H 0
observe that publishers \694" and \1677" have lesser CTR for almost all
algorithms compared to \418" and \35774". This might be partially due to how the
publishers present the recommendations. Some presentation might draw more
attention toward the suggested articles than other. The top-performing
algorithms \andreas" (team \abc") and \Recommender" (team \arti cial
intelligence") achieve the relatively highest CTR independent of the publisher.</p>
        <p>We expect a recommendation service's reliability to a ect the overall
performance. Failing to serve plenty of requests will negatively a ect CTR. Successfully
suggesting news items will harness valuable feedback to further improve the
recommendation algorithm. Figure 4 contrasts CTR and error rates observed during
the nal evaluation period. CTR refers to the ratio of clicked suggestions to
received requests. Error rates re ect the proportion of requests that could not be
served by the algorithm. Performances are colored with respect to the team
operating the recommendation service. Most teams managed to keep error rates
below 10 % with the exceptions of \riemannzeta", \riadi-gdl", and \university of
essex". Remarkably, team \riadi-gdl" achieved a CTR of 0:9 % at an error rate
of 53 %. This indicates that their algorithm frequently failed to provide
suggestions. Simultaneously, the suggestions given were particularly relevant to the
s
m
h
it
r
o
g
l
A
recencyRandom
recency2
recency
geoRecHistory
geoRec
beta 2.0
beta 1.0
andreas
algorithms2</p>
        <p>RingingBuff
Riadi_Recommender_Cloud_FM</p>
        <p>Riadi_Rec_FM_W_04</p>
        <p>Recommender</p>
        <p>DRB
8
1
4
recipients. Conversely, team \insight-centre" achieved a rather low error rate of
5:4 %. Still, their CTR did not exceed 0:2 %. Thereby, we conclude that while
reliability can a ect CTR, we have to consider additional factors. We note the
di erence in computing power between the baselines \riemannzeta" and
\gaussiannoise" described in Section 3.2 The more powerful \gaussiannoise" achieved
an error rate close to 0. In constrast, \riemannzeta" failed to respond to 16 %
of its requests.</p>
        <p>Task 2 The o ine evaluation (based on a dataset recorded in July and August
2015) enables the reproducible evaluation of stream-based recommender
algorithms. Having complete knowledge about the data set allows us to implement
new baseline strategies. In addition to the baseline recommender used in Task 1,
we implemented the \optimal" recommender. The recommender searches in the
data set the items that will be rewarded for the current request by the evaluation
component. This strategy used knowledge about the future. Thus, the strategy
is not a recommender algorithm; it only implements a data set look-up.
Consequently, this strategy cannot be used in the online \live" evaluation. Nevertheless
0.014
CTR
0.013
0.012
0.011
0.010
0.009
0.008
0.007
0.006
0.005
0.004
0.003
0.002
0.001
0</p>
        <p>Teams
abc
artificial intelligence
cwi
gaussiannoise
insight-centre
riadi-gdl
riemannzeta
university of essex
0
0.2
the measured CTR of the optimal recommender algorithm is interesting since
the strategy allows us to measure the upper bound for the CTR in the analyzed
setting.</p>
        <p>Figure 5 shows the maximal achievable CTR for the three di erent domains
in the o ine dataset. The graphs show that the CTR varies highly from day to
day. In addition, the graphs show that the average o ine CTR for each of the
analyzed news portals is speci c for each of the portals. This can be explained
by the di erent user groups and the di erences in the number of messages per
day. Due to the de nition of the o ine CTR, the expected CTR correlates with
the number of messages forwarded as requests to a participant.</p>
        <p>
          The evaluation with respect to scalability focused on maximizing the
throughput. Since the teams in the competition used di erent hardware con gurations,
the measured results cannot be compared directly. A common optimization
objective that has been addressed by the teams working on Task 2 is the e ective
synchronization of concurrently executed threads. This can be reached by using
3.5%
3.0%
highly optimized data structures (such as concurrent collections or Guava9) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
or by using frameworks for building asynchronous, distributable systems [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
Distributing a recommender algorithm over several machines adds extra
overhead but gives a high degree for exibility.
        </p>
        <p>For the next year, we plan to use standardized virtual machines for the
scalability evaluation, ensuring that all teams run the algorithms on exactly the same
\virtual" hardware. In order to hide the complexity of building the evaluation
environment, we plan to improve the Idomaar framework10 and facilitate getting
started with it.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Submissions</title>
        <p>
          We received two submissions detailing the e orts of two teams. Gebremeskel
and de Vries [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] explored the utility of geographic information. They hypothesize
that visitors have special interest in news stories about their local community.
The implement a recommender which leverages geographic data when matching
visitors and news articles. We refer to their results as team \cwi".
        </p>
        <p>
          Verbitskiy, Probst, and Lommatzsch [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] developed a most-popular
recommender. Their investigation targets scalability. They use the Akka framework
to bene t from concurrent message passing. They conducted their evaluation
outside the nal evaluation period. Still, they managed to obtain higher CTR
than the continued baselines.
        </p>
        <sec id="sec-3-4-1">
          <title>9 https://github.com/google/guava 10 https://github.com/crowdrec/idomaar</title>
          <p>NewsREEL aims to discover strategies that lter relevant news articles. Last
year's edition introduced a \living lab" setting. This allows participants to
evaluate their algorithms with actual users' feedback. This year's edition extended
the previous setting. We developed the Idomaar framework. It not only keeps
track of recommendation quality but records other performance metrics.</p>
          <p>We continued competing with our baseline and last year's winning approach
in order to demonstrate the ability of approaches to improve over both a basic
system, and also the state of the art. Task 1 provided results which con rmed
last year's ndings. The baseline proved to be hard to beat. Last year's winner
re-claimed the title. What produced this success story? Which factors determine
the superior recommendation quality of the \artifcial intelligence" approach?</p>
          <p>A team might have an advantage as it receive a larger or lower volume of
requests than its competitors. We observed a comparable volume of requests
for all algorithms active for the full evaluation period. These algorithms
collected on average 1000 requests per day. The few exceptions with less requests
were exactly those teams exhibiting higher error rates. Table1 shows requests on
team level. Teams running several algorithms simultaneously have more request
in total. Nevertheless, individual algorithms obtained similar shares of requests
considering error rates and periods of inactivity. Has \arti cial intelligence"
received disproportionately many requests of visitors disproportionately likely to
click? In that case, we would expect to observe varying performances at di
erent days and on di erent publishers. In other words, we assume only marginal
chances of receiving a speci c subset of visitors consistently throughout time and
publishers. Contrarily, Figure 2 shows consistent performance over average for
almost all days. Similarly, Figure 3 lacks evidence for variations with respect to
publishers. Is \arti cial intelligence" running more reliably than its competitors?
In fact, Figure 4 shows extremely low error rates. On the other hand, competitors
including \gaussiannoise" and \cwi" achieve similar error rates but fall behind
with respect to CTR. We conclude that combining popularity, freshness, and
trend-awareness gives \arti cial intelligence" a competitive advantage. Neither
chance, bias, nor reliability explain the superior performance over four weeks.</p>
          <p>We observed team \riadi-gdl" achieving the third best performance for an
individual algorithm. This algorithms su ered from high error rates. We lack
knowledge of the approach as we have not receive a working note for this
performance. Still, it appears to involve promising algorithms which we would like
to see more from in the future. Compensating the errors, the approach could
potentially achieve even higher CTR than \arti cial intelligence".
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>CLEF NewsREEL 2015 has been an interesting challenge motivating teams to
develop and benchmark recommender algorithms online and o ine. An addition
to the online evaluation focused on the maximizing the CTR, the o ine task
(Task 2) also considered technical issues (scalability, throughput). This year, the
participating teams tested several di erent approaches for recommending news,
ranging from a location-based approaches to most-popular algorithms optimized
for streams to ensemble recommender for streams. Analyzing the results we
found, that the provided baseline is hard to beat. Further, CTR varied with
respect to the publisher indicating additional factors that a ect performance.
We observed higher CTR levels compared to last year's edition. This indicates
that teams continue to optimize their algorithms.</p>
      <p>The technical challenges have been addressed by means of applying
optimized data structures supporting the simultaneous access by concurrently
running threads. One team focused on machines with multiple cores; another team
implemented an approach enabling the distribution over di erent machines
(using the Akka framework).</p>
      <p>
        Finally, we detected issues with the challenge and derived ways to further
improve participants' experience. Users struggled to get started. We had
provided tutorials for both tasks but participants appeared to require additional
support. The Idomaar framework had been updated during the competition.
On the one hand, this was necessary to x technical issues. On the other hand,
this required participants to adjust and monitor their systems to a larger degree.
Besides improving participants' support, we seek to increase the interchange
between both tasks. Participants who evaluate their news recommender with ORP
should take advantage of the recorded data to better tune their algorithms.
Conversely, participants working with the recorded data should check their
algorithms' performance with ORP. Thereby, they assure that their algorithms not
only scale well but provide relevant suggestions. Said et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] strongly advocate
such multi-objective evaluation.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>The work leading to these results has received funding (or partial funding) from
the Central Innovation Programme for SMEs of the German Federal Ministry
for Economic A airs and Energy, as well as from the European Unions
Seventh Framework Programme (FP7/2007-2013) under grant agreement number
610594.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          .
          <article-title>Shedding Light on a Living Lab: The CLEF NEWSREEL Open Recommendation Platform</article-title>
          .
          <source>In Proceedings of the Information Interaction in Context conference, IIiX'14</source>
          , pages
          <fpage>223</fpage>
          {
          <fpage>226</fpage>
          . Springer-Verlag,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G.</given-names>
            <surname>Gebremeskel</surname>
          </string-name>
          and
          <string-name>
            <surname>A. P. de Vries</surname>
          </string-name>
          .
          <article-title>The degree of randomness in a live recommender systems evaluation</article-title>
          .
          <source>In Working Notes for CLEF 2015 Conference</source>
          , Toulouse, France. CEUR,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mercer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kalpathy-Cramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollup</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Eggel.</surname>
          </string-name>
          <article-title>Report of the evaluation-as-a-service (EaaS) expert workshop</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>49</volume>
          (
          <issue>1</issue>
          ):
          <volume>57</volume>
          {
          <fpage>65</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Plumbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heintz</surname>
          </string-name>
          .
          <article-title>Benchmarking news recommendations in a living lab</article-title>
          .
          <source>In 5th International Conference of the CLEF Initiative</source>
          , pages
          <volume>250</volume>
          {
          <fpage>267</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heintz</surname>
          </string-name>
          .
          <article-title>The plista dataset</article-title>
          .
          <source>In NRS'13: Proceedings of the International Workshop and Challenge on News Recommender Systems</source>
          , pages
          <fpage>14</fpage>
          {
          <fpage>21</fpage>
          . ACM, 10
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sereny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Seiler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          .
          <article-title>Stream-based recommendations: Online and o ine evaluation as a service</article-title>
          .
          <source>In Proceedings of the 6th International Conference of the CLEF Association, CLEF'15</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Albayrak</surname>
          </string-name>
          .
          <article-title>Real-time recommendations for user-item streams</article-title>
          .
          <source>In Proc. of the 30th Symposium On Applied Computing, SAC</source>
          <year>2015</year>
          , SAC '
          <volume>15</volume>
          , pages
          <fpage>1039</fpage>
          {
          <fpage>1046</fpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Werner</surname>
          </string-name>
          .
          <article-title>Optimizing and evaluating stream-based news recommendation algorithms</article-title>
          .
          <source>In Proceedings of the Sixth International Conference of the CLEF Association, CLEF'15, LNCS</source>
          , vol.
          <volume>9283</volume>
          , Heidelberg, Germany,
          <year>2015</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tikk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stumpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <article-title>Recommender systems evaluation: A 3D benchmark</article-title>
          .
          <source>In Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE (RUE</source>
          <year>2012</year>
          ),
          <source>RUE'12</source>
          , pages
          <fpage>21</fpage>
          {
          <fpage>23</fpage>
          .
          <string-name>
            <surname>CEUR-WS Vol</surname>
          </string-name>
          .
          <volume>910</volume>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. I. Verbitskiy,
          <string-name>
            <given-names>P.</given-names>
            <surname>Probst</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          .
          <article-title>Developing and evaluation of a highly scalable news recommender system</article-title>
          .
          <source>In Working Notes for CLEF 2015 Conference</source>
          , Toulouse, France. CEUR,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>