=Paper= {{Paper |id=Vol-1391/inv-pap4-CR |storemode=property |title=Overview of CLEF NewsREEL 2015: News Recommendation Evaluation Lab |pdfUrl=https://ceur-ws.org/Vol-1391/inv-pap4-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/KilleLTSLBSH15a }} ==Overview of CLEF NewsREEL 2015: News Recommendation Evaluation Lab== https://ceur-ws.org/Vol-1391/inv-pap4-CR.pdf
          Overview of CLEF NewsREEL 2015:
         News Recommendation Evaluation Lab

    Benjamin Kille1 , Andreas Lommatzsch1 , Roberto Turrin2 , András Serény3 ,
     Martha Larson4 , Torben Brodt5 , Jonas Seiler5 , and Frank Hopfgartner6
                             1
                                TU Berlin, Berlin, Germany
                {benjamin.kille,andreas.lommatzsch}@dai-labor.de
                      2
                        ContentWise R&D - Moviri, Milan, Italy
                               roberto.turrin@moviri.com
                          3
                             Gravity R&D, Budapest, Hungary
                              sereny.andras@gravityrd.com
                          4
                              TU Delft, Delft, The Netherlands
                                 m.a.larson@tudelft.nl
                            5
                               Plista GmbH, Berlin, Germany
                     {torben.brodt,jonas.seiler}@plista.com
                        6
                            University of Glasgow, Glasgow, UK
                          frank.hopfgartner@glasgow.ac.uk



        Abstract. News reader struggle as they face ever increasing numbers of
        articles. Digital news portals are becoming more and more popular. They
        route news items to visitors as soon as they are published. The rapid
        rate at which new news is published gives rise to a selection problem,
        since the capacity of new portal videos to absorb news is limited. To
        address this problem, new portals deploy news recommender systems
        in order to support their visitors in selecting items to read. This paper
        summarizes the settings and results of CLEF NewsREEL 2015. The lab
        challenged participants to compete in either a “living lab” (Task 1) or
        an evaluation that replayed recorded streams (Task 2). The goal was
        to create an algorithm that was able to generate news items that users
        would click, respecting a strict time constraint.

        Keywords: news recommendation, recommender systems, evaluation,
        living lab


1     Introduction
News recommendation continues to draw the attention of researchers. Last year’s
edition of CLEF NewsREEL [4] introduced the Open Recommendation Platform
(ORP) operated by plista. ORP provides an interface to researchers interested
in news recommendation algorithms. They can easily plug in their algorithms
and receive requests from various news publishers. Subsequently, the systems
records recipients’ reaction. This feedback allows participants to improve their
algorithms. In contrast to traditional offline evaluation, this “living lab” ap-
proach reflects the application setting of an actual news recommender system.
Participants must satisfy technical requirements, and also face technical chal-
lenges. These include response time restrictions, handling peaks in the rate of
requests, and handling continuously changing collections of users and items. Con-
ceptually, the evaluation represents a fair competition. All participants have the
same chance to receive a request since ORP distributes them randomly. Random
distribution helps avoid selection bias.
    In addition to providing fair comparison, the NewsREEL challenge would
like to level the playing field for all participants. Specifically, the environments in
which participants operate their recommendation algorithms vary widely. First,
participants’ servers have to bypass varying distances to communicate with ORP.
ORP is located in Berlin, Germany. Participants from America, East-Asia, or
Australia face additional network latency compared to participants from Cen-
tral Europe. Their performance might suffer from failing to serve some requests
only due to latency. Second, participants use different hardware and software to
run their algorithms. Suppose a participants has access to a high-performance
cluster. Another participant runs their algorithm on a rather old stand-alone
machine. Is it fair to compare the performance of these participants? The latter
participant may have developed a sophisticated algorithm not perform well in
the competition since they cannot meet the response time requirements.
    This year’s edition of CLEF NewsREEL seeks to add another level of com-
parison to news recommendation. Our aim is to be able to fairly measure systems
with respect to non-functional requirements, and also allow all participants to
take part in the challenge on equal footing. We continue to offer the “living lab”
evaluation with ORP as Task 1. In addition, we introduce an offline evaluation
targeted at measuring additional aspects. These aspects include complexity and
scalability. In Task 2, we provide a large data set comprising interactions be-
tween users and various news portals in a two-month time span. Participants
are able to re-run the timestamped events to determine how well their sys-
tem scales. We introduce Idomaar, a framework designed to measure technical
parameters along with recommendation quality. Idomaar instantiates virtual
machines. Since these machines share their configuration, we obtain comparable
results. These results do not depend on the actual system. We kept the interfaces
similar to ORP’s such that participants could re-use their algorithms with only
a minor adaption effort.
    The remainder of this lab overview paper is structured as follows. In Section 2,
we introduce the two subtasks of NewsREEL’15. The results of the evaluation
are presented in Section 3. Section 4 concludes the paper.


2    Lab Setup

CLEF NewsREEL’15 consisted of two subtasks. Task 1 was a repetition of the
online evaluation task (“Task 2”) of NewsREEL’14. In Section 2.1, we briefly in-
troduce the recommendation use case of this task. For a more detailed overview,
the reader is referred to [4]. Section 2.2 introduces the second subtask that fo-
cuses on simulating constant data streams, hence allowing evaluation of real-time
recommenders using an offline data set. For a more detailed overview of this use
case, we refer to [6].


2.1    Task 1: Benchmark News Recommendations in a Living Lab

This task implements the idea of evaluation in a living lab. As such, partici-
pants were given the chance to directly interact with a real-time recommender
system. After registering with The Open Recommendation Platform (ORP) [1]
provided by plista GmbH, participants receive recommendation requests from
various websites offering news articles. Requests were triggered by users visiting
those websites.
    The task followed the idea of providing evaluation as a service [3]. Partici-
pants had access to a virtual machine where they could install their algorithm.
The recommender system forwarded the incoming requests to a random virtual
machine which produced the recommendation to be delivered to the requester.
The random choice was uniformly distributed over all participants. Alternatively,
participants could set up their own server to respond to incoming requests.
    As a fixed response time limitation was set, the participants experienced
typical restrictions for real-world recommender systems. Such restrictions pose
requirements regarding scalability and computational complexity for the recom-
mendation algorithms.
    ORP monitored the performance of all participants during the challenge du-
ration by measuring the recommenders’ click through rate (CTR). CTR repre-
sents the ratio of clicks by requests. Participants had the chance to continuously
update their parameter settings in order to improve their performance levels.
Results were published on a regular basis to allow participants to compare their
performance with respect to baseline and competing approaches. An overview
of the results is given by Kille et al. [6], and also in this paper in Section 3.


2.2    Task 2: Benchmarking News Recommendations in a Simulated
       Environment

For the second task, we employed the benchmarking framework Idomaar7 that
makes it possible to simulate data streams by “replaying” a recorded stream. The
framework is being developed in the CrowdRec project8 It makes it possible to
execute and test the proposed news recommendation algorithms, independently
of the execution framework and the language used for the development. Partic-
ipants of this task had to predict users clicks on recommended news articles in
simulated real-time. The proposed algorithms were evaluated against both func-
tional (i.e., recommendation quality) and non-functional (i.e., response time)
metrics. The data set used for this task consists of news updates from diverse
news publishers, user interactions and clicks on recommendations. An overview
of the features of the data set is provided by Kille et al. [5].
7
    http://rf.crowdrec.eu/
8
    http://crowdrec.eu/
3     Evaluation
In this section, we detail results of CLEF NewsREEL 2015. We start by giving
some statistics about the participation in general. Then, we discuss the results
for both tasks.

3.1   Participation
Forty-two teams registered for CLEF NewsREEL 2015. Of these teams, 38 teams
expressed interest in both tasks. A single participant registered for Task 2 only.
Three teams wanted to focus on Task 1. Participating teams distribute across
the world including all continents except Australia. ORP’s operators, plista,
provided five virtual machines to participants who were located far from Berlin,
Germany. Without these machines participants would have faced issues with
network latency, already discussed above.

    Nine teams actively competed in Task 1. The competition’s schedule con-
sisted of three evaluation time frames: 17–23 March, 7–13 April, and 5 May
to 2 June 2015. Seven out of nine teams competed in all three periods. Team
“irit-imt” stopped competing after the second period. Team “university of essex”
entered the competition as the final period started. Each team could operate sev-
eral recommendation services. Each recommendation service obtained a similar
volume of requests if active for similar times. We received a submission describ-
ing the idea and results of team “cwi” [2].



3.2   Baselines
Within the evaluation, we sought to obtain comparable results. Baselines allow
us to determine how well a participant performs relative to a very basic approach.
In last year’s edition of NewsREEL, we established the baseline discussed in [4].
This baseline allocates an array of fixed length for item references. As we observe
visitors interacting with the news portal, we put item references into the array.
As we receive a recommendation request, we reversely iterate the array returning
the first item references that are unknown to the target user. In this way, the
baseline considers both freshness and popularity. We operated the baseline on
two machines, “riemannzeta” and “gaussiannoise”, which represented two dif-
ferent levels of machine power. The team “riemannzeta” administered a virtual
machine with a dual-core Intel Xeon X7560 @ 2.27 GHz, 2 GB of RAM, and 8 GB
hard drive. The team “gaussiannoise” operated a more powerful virtual machine
with a quad-core Intel Xeon X7550 @ 2.0 GHz, 8 GB of RAM, and 26 GB hard
drive. We released the baseline approach in form of a tutorial. Participants could
take advantage of the baseline. Additionally, we sought to establish comparabil-
ity with respect to last year’s winner. Last year’s winning approach has been
documented in [7]. The approach competed as “abc” and in a slightly adjusted
version as “artificial intelligence”, also described in [7].
3.3   Results

Task 1 We observed nine teams actively participating throughout CLEF News-
REEL 2015. We recorded the performance of participants during three periods:
17–23 March, 7–13 April, and 5 May - 2 June 2015. The former two periods
span a week each; the latter amounts to four weeks of data. The schedule had
intentional gaps between the periods allowing participants to improve their al-
gorithms. Table 1 summarizes the performances on team level. Each team has
the number of requests (R), number of clicks (C), and their proportion (C/R)
assigned for each of the three periods. Fields with ‘n/a’ refer to lack of par-
ticipation. The highest average CTR per time slot is typeset in bold face. We
observe that these values increased as the competition progressed. This indicates
that teams managed to improve their recommendation algorithms over time. In
addition, this could signal that teams learned to adjust their systems better to
the challenge’s requirements. Team “irit-imt” received 44 clicks at 5597 requests
leading to the highest CTR (0.79 %) in the time slot from 17–23 March. Team
“abc” received 56 clicks at 6483 requests obtaining a CTR of 0.86 % surpassing
all competitors in the time slot from 7–13 April. Team “artificial intelligence”
collected 302 clicks at 23 756 requests resulting in a CTR of 1.27 % in the final
four week time slot.



Table 1. We present the results of nine participating teams. Each participant could
operate several algorithms simultaneously. Results are aggregated over all algorithms.
The evaluation includes three periods. We report the number of clicks (C), requests
(R), and their relation (C/R). We highlight the highest CTR for each interval by bold
typeface.

                            17–23 March         7–13 April         5 May – 2 June
Team                    C      R C/R        C      R C/R           C     R C/R
abc                      73 9740    0.75% 56 6483      0.86%      349 30649    1.14%
artificial intelligence 71 10234    0.69% 49 6479      0.76%      302 23756    1.27%
cwi                     161 24644   0.65% 130 22767    0.57%     1082 149544   0.72%
gaussiannoise            55 10063   0.55% 44 6515      0.68%      249 31343    0.79%
insight-centre           26 6833    0.38% 27 6500      0.42%       48 28857    0.17%
irit-imt                 44 5597    0.79% 63 9481      0.66%      n/a    n/a   n/a
riadi-gdl                 0    26   0.00% 45 6303      0.71%      177 27412    0.65%
riemannzeta              50 7684    0.65% 23 3833      0.60%      185 22064    0.84%
university of essex     n/a n/a     n/a   n/a n/a      n/a         17 5562     0.31%



   Each participant could simultaneously operate several recommendation en-
gines. Some participants took advantage of this offer. Consequently, those teams
accumulated considerably more requests than others. Figure 1 illustrates the per-
formance of individual algorithms. We present performance on a plane defined
by the number of clicks and requests. A point on this plane refers to a specific
CTR. Points’ colors refer to the respective team. The teams “cwi” and “riadi-
gdl” deployed several algorithms. Two lines depict two CTR levels. A drawn
through line marks the 1.0 % level. A dashed line represents the 0.5 % level. The
illustration confirms that teams “abc” and “artificial intelligence” outperformed
their competitors.


     400
 Clicks

                                                             Teams
    300
                                                                abc
                                                                artificial intelligence
                                                                cwi
    200                                                         gaussiannoise
                                                                insight-centre
                                                                riadi-gdl
    100                                                         riemannzeta
                                                                university of essex
                                                                CTR = 1.0%
                                                                CTR = 0.5%
       0
           0           104           2×104          3×104 Requests


Fig. 1. Team were eligible to run several algorithms simultaneously. We observe some
teams operating various recommenders. Teams “abc” and “artificial intelligence” man-
aged to achieve a CTR of more than 1 %.


    We investigate how individual algorithms perform over time. Figure 2 dis-
plays 16 algorithms’ CTR relative to the average CTR over the final evaluation
period’s 28 days. Areas below 0 indicate a CTR lower than the average CTR
of that day. Areas above 0 represent days with above average CTR. First, we
observe that only a subset of algorithms ran throughout the period. Algorithms
A, C, E, and K operated only scarcely. Algorithms F (“artificial intelligence”)
and J (“abc”) managed to perform above the average CTR on almost all days.
The majority of algorithms’ CTR fluctuates around the system’s average CTR.
This confirms the difficulty inherent to news recommendation. The choice of an
algorithm may depend on factors which are subject to change.
    The competition featured a variety of news publishers. Some provide gen-
eral as well as regional news. Other news portals specialize on topics such
as sports or information technology. Figure 3 relates 16 competing algorithms
with four major publishers. Publishers “418” (www.ksta.de) and “1677” (www.
tagesspiegel.de) provide general and regional news. Publisher “35774” (www.
sport1.de) targets sport-related news stories. Publisher “694” (www.gulli.com)
presents information technology news. Combined, they account for ≈ 85 % of rec-
ommendation requests. The heatmap illustrates higher CTR with darker shades.
CTR ranges up to 2.5 % for some combinations of publishers and algorithms. We
         A 0                                                              I

         B 0                                                              J

         C 0                                                              K

         D 0                                                              L

         E 0                                                              M

         F 0                                                              N

         G 0                                                              O

         H 0                                                              P
                       10        20                  10        20

Fig. 2. We compare the performance of 16 algorithms over the final evaluation pe-
riod’s span of 29 days. We compute the average CTR for each day. Subsequently, we
subtract the result from each algorithm’s individual CTR for the same day. The labels
map to teams as follows: A–E → “riadi-gld”, F → “artificial intelligence”, G → “gaus-
siannoise”, H → “riemannzeta”, I → “insight-centre”, J → “abc”, K → “university of
essex”, and L–P → “cwi”.



observe that publishers “694” and “1677” have lesser CTR for almost all algo-
rithms compared to “418” and “35774”. This might be partially due to how the
publishers present the recommendations. Some presentation might draw more
attention toward the suggested articles than other. The top-performing algo-
rithms “andreas” (team “abc”) and “Recommender” (team “artificial intelli-
gence”) achieve the relatively highest CTR independent of the publisher.
    We expect a recommendation service’s reliability to affect the overall perfor-
mance. Failing to serve plenty of requests will negatively affect CTR. Successfully
suggesting news items will harness valuable feedback to further improve the rec-
ommendation algorithm. Figure 4 contrasts CTR and error rates observed during
the final evaluation period. CTR refers to the ratio of clicked suggestions to re-
ceived requests. Error rates reflect the proportion of requests that could not be
served by the algorithm. Performances are colored with respect to the team op-
erating the recommendation service. Most teams managed to keep error rates
below 10 % with the exceptions of “riemannzeta”, “riadi-gdl”, and “university of
essex”. Remarkably, team “riadi-gdl” achieved a CTR of ≈ 0.9 % at an error rate
of ≈ 53 %. This indicates that their algorithm frequently failed to provide sug-
gestions. Simultaneously, the suggestions given were particularly relevant to the
                                  recencyRandom                                0.025
                                        recency2
                                          recency
                                                                               0.020
                                    geoRecHistory
                                          geoRec
         Algorithms



                                         beta 2.0                              0.015
                                         beta 1.0
                                         andreas
                                                                               0.010
                                      algorithms2
                                      RingingBuff
                      Riadi_Recommender_Cloud_FM                               0.005
                              Riadi_Rec_FM_W_04
                                    Recommender
                                                                               0.000
                                            DRB
                                                    418
                                                          694
                                                                1677
                                                      Publisher        35774



Fig. 3. The heatmap shows the Click-Through-Rates observed for combinations of al-
gorithms and publishers. The four publishers account for ≈ 85 % of requests. Publishers
“418” and “35774” obtain a higher CTR compared to “694” and “1677” on average.



recipients. Conversely, team “insight-centre” achieved a rather low error rate of
≈ 5.4 %. Still, their CTR did not exceed 0.2 %. Thereby, we conclude that while
reliability can affect CTR, we have to consider additional factors. We note the
difference in computing power between the baselines “riemannzeta” and “gaus-
siannoise” described in Section 3.2 The more powerful “gaussiannoise” achieved
an error rate close to 0. In constrast, “riemannzeta” failed to respond to ≈ 16 %
of its requests.

Task 2 The offline evaluation (based on a dataset recorded in July and August
2015) enables the reproducible evaluation of stream-based recommender algo-
rithms. Having complete knowledge about the data set allows us to implement
new baseline strategies. In addition to the baseline recommender used in Task 1,
we implemented the “optimal” recommender. The recommender searches in the
data set the items that will be rewarded for the current request by the evaluation
component. This strategy used knowledge about the future. Thus, the strategy
is not a recommender algorithm; it only implements a data set look-up. Conse-
quently, this strategy cannot be used in the online “live” evaluation. Nevertheless
          0.014
                                                          Teams
          CTR
          0.013                                              abc
                                                             artificial intelligence
           0.012                                             cwi
                                                             gaussiannoise
           0.011                                             insight-centre
                                                             riadi-gdl
           0.010                                             riemannzeta
                                                             university of essex
           0.009

           0.008

           0.007

           0.006

           0.005

           0.004

           0.003

           0.002

           0.001

               0
                   0        0.2          0.4        0.6
                               Error Rate


Fig. 4. The figure illustrates the relation between error rates and CTR as observed in
the final evaluation period. Algorithms are colored according to their team membership.
CTR refers to the ratio of clicks to requests. Error rates represent the proportion of
requests which could not be served by the algorithm.



the measured CTR of the optimal recommender algorithm is interesting since
the strategy allows us to measure the upper bound for the CTR in the analyzed
setting.
    Figure 5 shows the maximal achievable CTR for the three different domains
in the offline dataset. The graphs show that the CTR varies highly from day to
day. In addition, the graphs show that the average offline CTR for each of the
analyzed news portals is specific for each of the portals. This can be explained
by the different user groups and the differences in the number of messages per
day. Due to the definition of the offline CTR, the expected CTR correlates with
the number of messages forwarded as requests to a participant.
    The evaluation with respect to scalability focused on maximizing the through-
put. Since the teams in the competition used different hardware configurations,
the measured results cannot be compared directly. A common optimization ob-
jective that has been addressed by the teams working on Task 2 is the effective
synchronization of concurrently executed threads. This can be reached by using
                                           3.5%
                                                          domainID: 596
                                           3.0%           domainID: 694




            Click-Through-Rate (offline)
                                                          domainID: 1677
                                           2.5%

                                           2.0%

                                           1.5%

                                           1.0%

                                           0.5%

                                           0.0%




Fig. 5. The figure visualizes the offline CTR for the optimal “recommender algorithm”.
The optimal recommendation strategy is implemented by looking up the items that
will be rewarded by the evaluator. The strategy defines the upper bound of the CTR
reachable in Task 2.


highly optimized data structures (such as concurrent collections or Guava9 ) [8]
or by using frameworks for building asynchronous, distributable systems [10].
Distributing a recommender algorithm over several machines adds extra over-
head but gives a high degree for flexibility.
    For the next year, we plan to use standardized virtual machines for the scala-
bility evaluation, ensuring that all teams run the algorithms on exactly the same
“virtual” hardware. In order to hide the complexity of building the evaluation
environment, we plan to improve the Idomaar framework10 and facilitate getting
started with it.


3.4     Submissions

We received two submissions detailing the efforts of two teams. Gebremeskel
and de Vries [2] explored the utility of geographic information. They hypothesize
that visitors have special interest in news stories about their local community.
The implement a recommender which leverages geographic data when matching
visitors and news articles. We refer to their results as team “cwi”.
    Verbitskiy, Probst, and Lommatzsch [10] developed a most-popular recom-
mender. Their investigation targets scalability. They use the Akka framework
to benefit from concurrent message passing. They conducted their evaluation
outside the final evaluation period. Still, they managed to obtain higher CTR
than the continued baselines.
9
     https://github.com/google/guava
10
     https://github.com/crowdrec/idomaar
3.5   Discussion

NewsREEL aims to discover strategies that filter relevant news articles. Last
year’s edition introduced a “living lab” setting. This allows participants to eval-
uate their algorithms with actual users’ feedback. This year’s edition extended
the previous setting. We developed the Idomaar framework. It not only keeps
track of recommendation quality but records other performance metrics.
    We continued competing with our baseline and last year’s winning approach
in order to demonstrate the ability of approaches to improve over both a basic
system, and also the state of the art. Task 1 provided results which confirmed
last year’s findings. The baseline proved to be hard to beat. Last year’s winner
re-claimed the title. What produced this success story? Which factors determine
the superior recommendation quality of the “artifcial intelligence” approach?
    A team might have an advantage as it receive a larger or lower volume of
requests than its competitors. We observed a comparable volume of requests
for all algorithms active for the full evaluation period. These algorithms col-
lected on average ≈ 1000 requests per day. The few exceptions with less requests
were exactly those teams exhibiting higher error rates. Table1 shows requests on
team level. Teams running several algorithms simultaneously have more request
in total. Nevertheless, individual algorithms obtained similar shares of requests
considering error rates and periods of inactivity. Has “artificial intelligence” re-
ceived disproportionately many requests of visitors disproportionately likely to
click? In that case, we would expect to observe varying performances at differ-
ent days and on different publishers. In other words, we assume only marginal
chances of receiving a specific subset of visitors consistently throughout time and
publishers. Contrarily, Figure 2 shows consistent performance over average for
almost all days. Similarly, Figure 3 lacks evidence for variations with respect to
publishers. Is “artificial intelligence” running more reliably than its competitors?
In fact, Figure 4 shows extremely low error rates. On the other hand, competitors
including “gaussiannoise” and “cwi” achieve similar error rates but fall behind
with respect to CTR. We conclude that combining popularity, freshness, and
trend-awareness gives “artificial intelligence” a competitive advantage. Neither
chance, bias, nor reliability explain the superior performance over four weeks.
    We observed team “riadi-gdl” achieving the third best performance for an
individual algorithm. This algorithms suffered from high error rates. We lack
knowledge of the approach as we have not receive a working note for this per-
formance. Still, it appears to involve promising algorithms which we would like
to see more from in the future. Compensating the errors, the approach could
potentially achieve even higher CTR than “artificial intelligence”.


4     Conclusion

CLEF NewsREEL 2015 has been an interesting challenge motivating teams to
develop and benchmark recommender algorithms online and offline. An addition
to the online evaluation focused on the maximizing the CTR, the offline task
(Task 2) also considered technical issues (scalability, throughput). This year, the
participating teams tested several different approaches for recommending news,
ranging from a location-based approaches to most-popular algorithms optimized
for streams to ensemble recommender for streams. Analyzing the results we
found, that the provided baseline is hard to beat. Further, CTR varied with
respect to the publisher indicating additional factors that affect performance.
We observed higher CTR levels compared to last year’s edition. This indicates
that teams continue to optimize their algorithms.
    The technical challenges have been addressed by means of applying opti-
mized data structures supporting the simultaneous access by concurrently run-
ning threads. One team focused on machines with multiple cores; another team
implemented an approach enabling the distribution over different machines (us-
ing the Akka framework).
    Finally, we detected issues with the challenge and derived ways to further
improve participants’ experience. Users struggled to get started. We had pro-
vided tutorials for both tasks but participants appeared to require additional
support. The Idomaar framework had been updated during the competition.
On the one hand, this was necessary to fix technical issues. On the other hand,
this required participants to adjust and monitor their systems to a larger degree.
Besides improving participants’ support, we seek to increase the interchange be-
tween both tasks. Participants who evaluate their news recommender with ORP
should take advantage of the recorded data to better tune their algorithms.
Conversely, participants working with the recorded data should check their al-
gorithms’ performance with ORP. Thereby, they assure that their algorithms not
only scale well but provide relevant suggestions. Said et al. [9] strongly advocate
such multi-objective evaluation.


Acknowledgement
The work leading to these results has received funding (or partial funding) from
the Central Innovation Programme for SMEs of the German Federal Ministry
for Economic Affairs and Energy, as well as from the European Unions Sev-
enth Framework Programme (FP7/2007-2013) under grant agreement number
610594.


References
 1. T. Brodt and F. Hopfgartner. Shedding Light on a Living Lab: The CLEF NEWS-
    REEL Open Recommendation Platform. In Proceedings of the Information Inter-
    action in Context conference, IIiX’14, pages 223–226. Springer-Verlag, 2014.
 2. G. Gebremeskel and A. P. de Vries. The degree of randomness in a live rec-
    ommender systems evaluation. In Working Notes for CLEF 2015 Conference,
    Toulouse, France. CEUR, 2015.
 3. F. Hopfgartner, A. Hanbury, H. Mueller, N. Kando, S. Mercer, J. Kalpathy-Cramer,
    M. Potthast, T. Gollup, A. Krithara, J. Lin, K. Balog, and I. Eggel. Report of the
    evaluation-as-a-service (EaaS) expert workshop. SIGIR Forum, 49(1):57–65, 2015.
 4. F. Hopfgartner, B. Kille, A. Lommatzsch, T. Plumbaum, T. Brodt, and T. Heintz.
    Benchmarking news recommendations in a living lab. In 5th International Con-
    ference of the CLEF Initiative, pages 250–267, 2014.
 5. B. Kille, F. Hopfgartner, T. Brodt, and T. Heintz. The plista dataset. In NRS’13:
    Proceedings of the International Workshop and Challenge on News Recommender
    Systems, pages 14–21. ACM, 10 2013.
 6. B. Kille, A. Lommatzsch, R. Turrin, A. Sereny, M. Larson, T. Brodt, J. Seiler, and
    F. Hopfgartner. Stream-based recommendations: Online and offline evaluation
    as a service. In Proceedings of the 6th International Conference of the CLEF
    Association, CLEF’15, 2015.
 7. A. Lommatzsch and S. Albayrak. Real-time recommendations for user-item
    streams. In Proc. of the 30th Symposium On Applied Computing, SAC 2015, SAC
    ’15, pages 1039–1046, New York, NY, USA, 2015. ACM.
 8. A. Lommatzsch and S. Werner. Optimizing and evaluating stream-based news
    recommendation algorithms. In Proceedings of the Sixth International Conference
    of the CLEF Association, CLEF’15, LNCS, vol. 9283, Heidelberg, Germany, 2015.
    Springer.
 9. A. Said, D. Tikk, K. Stumpf, Y. Shi, M. Larson, and P. Cremonesi. Recommender
    systems evaluation: A 3D benchmark. In Proceedings of the Workshop on Recom-
    mendation Utility Evaluation: Beyond RMSE (RUE 2012), RUE’12, pages 21–23.
    CEUR-WS Vol. 910, 2012.
10. I. Verbitskiy, P. Probst, and A. Lommatzsch. Developing and evaluation of a highly
    scalable news recommender system. In Working Notes for CLEF 2015 Conference,
    Toulouse, France. CEUR, 2015.