Overview of CLEF NEWSREEL 2014:
              News Recommendation Evaluation Labs

     Benjamin Kille1 , Torben Brodt2 , Tobias Heintz2 , Frank Hopfgartner1 , Andreas
                            Lommatzsch1 , and Jonas Seiler2
 1
     DAI-Labor, Technische Universität Berlin, Ernst-Reuter-Platz 7, D-10587 Berlin, Germany
                {hopfgartner,kille,lommatzsch}@dai-labor.de
                                         2
                                           plista GmbH
                           Torstr. 33-35, D-10119 Berlin, Germany
                              {tb,thz,jse}@plista.com


        Abstract. This paper summarises objectives, organisation, and results of the first
        news recommendation evaluation lab (N EWSREEL 2014). N EWSREEL targeted
        the evaluation of news recommendation algorithms in the form of a campaign-
        style evaluation lab. Participants had the chance to apply two types of evaluation
        schemes. On the one hand, participants could apply their algorithms onto a data
        set. We refer to this setting as off-line evaluation. On the other hand, participants
        could deploy their algorithms on a server to interactively receive recommendation
        requests. We refer to this setting as on-line evaluation. This setting ought to reveal
        the actual performance of recommendation methods. The competition strived to
        illustrate differences between evaluation with historical data and actual users. The
        on-line evaluation does reflect all requirements which active recommender systems
        face in practise. These requirements include real-time responses and large-scale
        data volumes. We present the competition’s results and discuss commonalities
        regarding participants’ approaches.

        Keywords: recommender systems, news, on-line evaluation, living lab


1     Introduction

The spectrum of available news continuously grows as news publishers keep producing
news items. At the same time, we observe publishers shifting from pre-dominantely
print media towards on-line news outlets. These on-line news portals confront users with
the choice between numerous news items inducing an information overload. Readers
struggle to detect relevant news items in the continuous flow of information. Therefore,
operators of news portals have established systems to support them [2]. The support
includes personalisation, navigation, context-awareness, and news aggregation.
    CLEF NEWSREEL focuses on support through (personalised) content selection in
form of news recommendations. We assume that users benefit as news portals adapt to
current trends, news’ relevancy, and individual tastes. News recommendation partially
includes enhanced navigation as well as context-awareness. Recommended news items
serves as a mean to quickly navigate to relevant contents. Thus, users avoid returning to
the home page to continue consuming news. In addition, news recommender systems


                                                790
may take advantage of contextual factors. These factors include time, locality, along
with trends.
    Within NEWSREEL participants ought to find recommendation algorithms suggest-
ing news items for a variety of news portals. These news portals cover several domains
including general news, sports, and information technology. All news portals provide
pre-dominantely German news articles. Consequently, approximately 4 out of 5 visitors’
browsers carry location identifiers pointing to Germany, Austria, or Switzerland, respec-
tively. The goal of the lab was to let participants determine which of these factors play an
important role when recommending news items. The remainder of this paper is organised
as follows. Section 2 describes the two tasks and their evaluation methodology. Section 3
summarises the results of the lab and discusses difficulties reported by participants.
Section 4 concludes the paper and gives an outlook on how we attempt to continue
evaluating news recommendation algorithms.


2     Lab Setup
CLEF NEWSREEL consisted of two tasks. For Task 1 we provided a data set containing
recorded interactions with news portals. We refer to Task 1 as off-line evaluation. In
addition, participants could deploy their recommendation algorithms in a living lab for
Task 2. We refer to Task 2 as on-line evaluation and to the living lab platform as the Open
Recommendation Platform (ORP)3 . ORP is operated by plista4 , a company that provides
content distribution as well as targeted advertising services for a variety of websites. We
dedicate a section to each task describing the goal and evaluation methodology. The
reader is refered to [10] for a detailed overview of the setup.

2.1     Task 1: Off-line Evaluation
Task 1 mirrors the paradigm of formerly held recommendation challenges such as the
Netflix Prize challenge (cf. [1]). As part of the challenge, Netflix released a collection
of movie ratings. Participants had to predict ratings for unknown (user, item)-pairs in a
hold-out evaluation set. Analoguously, we split a collections of interaction with news
items in training and test partitions. The initial data set has been described in [11].
Netflix could split their data randomly. This is due to the underlying assumption that
movie preferences remain constant over time. In other words, users will continue to
(dis-)like movies they once (dis-)liked. In contrast, we refrain to assume that users will
enjoy reading news articles they once read. Conversely, we suppose that news’ relevancy
decreases relatively quickly. Thus, we relinquished to randomly select interactions for
evaluation. Instead, we randomly selected time frames which we completely removed
from the data set. We avoided a moving time-window approach, as this would have meant
to release the entire data collection. We considered 3 parameters for the randomised
sampling:

    – Portal specificity
 3
     http://orp.plista.com
 4
     http://plista.com


                                           791
  – Interval width
  – Interval frequency

    Portal specificity refers to the choice between using identical time intervals for all
news portals and having portal-specific intervals. The former alternative lets us treat
all portals in the same way. On the other hand, the latter alternative provides a setting
where participants may utilise information from other sources – i.e., other news portals
– which better reflects the situation actual news recommenders reflect. For instance,
articles targeting a certain event may have been published on some news portals already.
The complementary portals could use interactions with these articles to boost their own
articles. We decided to sample portal-specific time slots for evaluation. Selecting a suited
interval width represents a non-trivial task. Choosing the width too small will result in an
insufficient amount of evaluation data. Conversely, setting the width too large will entail
a rather high number of articles as well as users missing in the training data. Additionally,
the amount of interactions varies over the day and the week. For instance, we observe
considerably fewer interaction in the night than in day times. We decided not to keep
the interval width fixed, since we expected that this would remedy coincidental bad
choices. Thus, we varied the interval width in the set {30, 60, 120, 180, 240} minutes. We
observed that recommendation algorithms will struggle to provide adequate suggestion
based on training data that lacks the most recent 4 hours [15]. This is mainly due to the
rapidly evolving character of news. Moreover, we observe that news portals continue
to provide new items which attract a majority of readers. The intial raw data covers a
time span of about 1 month. We faced the decision on how many time slots to remove
for evaluation. We had to avoid removing data as extensively as leaving insufficient data
for training. Conversely, we strived to obtain expressive results. We decided to sample
approximately 15 time slots per news portal. Thus, we expected to extract evaluation
data about every second day. We noticed some time slots overlapped by chance. We
decided to merge both time slots and refrained from resampling. Algorithm 1 outlines
the sampling procedure.


Algorithm 1 Sampling Procedure
Input: set of p news portals P , set of w interval widths W , number of samples s

  T ←∅                                                         . set to contain the sample result
  function SAMPLE(P, W, s)
     for i ← 1top do
         for j ← 1tos do
             T ← T ∪ random(t, w)        . randomly choose a time point t and interval width w
         end for
     end for
     return T
  end function


   Having created data for training and testing, we yet have to determine an evaluation
metric. Literature on recommender systems’ evaluation provides a rich set of metrics.


                                             792
Metrics relating to rating prediction accuracy and item ranking are among the most
popular choices. Hereby, root mean squared error (RMSE) and mean absolute error
(MAE) are frequently used for the former evaluation setting (cf. [8, 9]). Supporters
of ranking-oriented evaluation favour metrics such as precision/recall [5], normalised
discounted cumulative gain [16], or mean reciprocal rank [14]. Rating prediction as well
as ranking-based evaluation require preferences with graded relevancy as input. However,
users do not tend to rate news articles. Thus, we cannot apply rating prediction metrics.
Also, we cannot apply ranking metrics as we lack data about the pair-wise preference
towards news items. Our data carry the signal of users interacting with news items. Thus,
we ended up to define the evaluation metric based upon the ability to correctly predict
whether an interaction will occur.
    Let the pair (u, i) denote user u reading news item i included in the evaluation data.
We challenged participants to select the 10 items each previously observed user would
interact with in each evaluation time slot. The choice of exactly 10 items to suggest may
appear arbitrary. We observe a majority of users interacting with only few items. Thus,
most of the suggestions are likely not correct. On the other hand, limiting the number of
suggestions to very few items entails drawbacks as well. Imagine a user who actually
reads five articles in a time slot contained in the evaluation data. Having participants
suggesting 3 items, a recommender predicting all 5 interactions correctly will appear
to perform on level with a recommender only predicting 3 interactions correctly. Thus,
requesting many suggestions will provide more sensitivity. This sensitivity allows us to
better differentiate the individual recommendation algorithms’ performances. On the
other hand, the included news portals do not provide more than 6 recommendations
at a time. Hence, requesting substantially more recommendation will induce a setting
which insufficiently reflects the actual use case. Thus, we opted for 10 suggestions
which represents a reasonable trade-off between sensitivity and reflecting the actual
scenario. Note that [6] found that 10 preferences typically suffice to provide adequate
recommendation. Finally, we define the evaluation metric according to Equation 1:
                                    P         P10
                                        u∈U     j=1 I(u, ij )
                               h=                                                     (1)
                                              10|U |
    where h refers to the hitrate. I represents the indicator function returning 1 if the
predicted interaction occurred and 0 otherwise. The denominator normalises the number
of hits by the maximal number possible. Thus, the hitrate falls into the interval [0, 1].
Since most users will not exhibit 10 interactions in the evaluation time slot, we expect
the hitrate to be closer to 0.


2.2   Task 2: On-line Evaluation

Task 2 follows an alternative paradigm compared to Task 1. Task 1 assures comparability
of results. This is mainly due to the fact that all participants apply their algorithm
onto identical data. Contrarily, Task 2 provides a setting where participants have to
handle similar yet not identical data. The plista GmbH has established a living lab
where researchers and practitioners can deploy their recommendation algorithms to
interact with actual users. This approach allows us to observe the actual performance of


                                          793
recommendation methods. This means that our findings will reflect actual benefits for
real users. Further, we are able to observe variations throughout time and conduct studies
on large scale as we record more and more data. Conversely, evaluation on recorded
data expresses how a method would have performed. The approach does entail some
disadvantages as well. Participants had to deal with technical requirements including
response times, scalability, and availability. Deployed systems faced numerous requests
which they had to reply to in at most 100ms. This response time restriction represents a
particular challenge for participants located far from Germany where the ORP servers
are located. Network latencies might further reduce the available response time. We
offered virtual machines to participants who either had no servers at their disposal or
suffered from high network latency. As a result, these requirements allowed us to verify
how well certain recommendation algorithms adapt to real-world settings.
    We asked participants to deploy their recommendation algorithm to a server. Subse-
quently, they connected the server to ORP which forwarded recommendation requests.
Widgets on the individual news portals’ website displayed the suggested news items to
users. ORP tracks success in terms of clicks. This opens up several ways to evaluate
participants’ performances. One option is to consider the number of clicks. Considering
the relative number of clicks by requests represents another option. Industry refers to this
metric as click-through-rate. Given a comparable number of requests, both quantities
coincide. In situations with varying number of requests, evaluation becomes tricky. Con-
sidering the total number of clicks may bias the evaluation in favour of the participant
with more clicks. Conversely, considering the relative number of clicks per requests may
favour teams with few requests. We want to evaluate the performance of a recommenda-
tion algorithm. ORP provides all participants with the chance to obtain similar number
of requests. We decided to consider the absolute number of clicks as decisive criteria.
Nevertheless, we additionally present the relative number of clicks per requests in our
evaluation.
    Baselines support comparing the relative performance of algorithms. We deployed a
baseline which is detailed in [13]. The baseline combines two important factors for news
recommendation: popularity and recency. We consider a fixed number of interactions
that most recently occurred. Our baseline recommends news items included in this list
that users had not previously seen. Consequently, we obtain a computationally efficient
method that inherently considers popularity and recency.
    We realised the participants’ need to tune their algorithms. For this reason, we
explicitly defined 3 evaluation periods during which performances would be logged.
Participants could improve their algorithms before as well as in between the periods. We
set the 3 periods to 7-23 February, 1-14 April, and 27-31 May.


                                           794
3     Evaluation

In this section, we detail results of CLEF NEWSREEL 2014. We start by giving some
statistics about the participation in general. Then, we discuss the results for both tasks.
Note that we unfortunately did not receive any submissions for Task 1. We provide some
considerations about reasons for this.


3.1   Participation

51 participants registered for Task 1. 52 participants registered for Task 2. Thereof,
no participant submitted a solution for Task 1. We observed 13 active participants for
Task 2. Note that participants had the chance to contribute several solutions for Task 2. 4
participants submitted a working notes paper to the CLEF proceedings [3].


3.2   Evaluation of Task 1

We have not received any submissions for Task 1. Thus, we cannot report any results
on how well the future interactions could be predicted. We can think of several reasons
which may have prevented participants from submitting results. First, the data set
exhibits a large volume of more than 60GB. Thus, we required participants to process
such volumes. Participants’ available computational resources may not have allow to
iteratively optimise their recommendation algorithms for this amount of data. Second,
we imagine that participants might have preferred Task 2 over Task 1. This preferences
may be due to the interactive character as well as the rather unique chance to evaluate
algorithms with actual users’ feedback. We admit that there are rather plenty of data
set driven competition. For instance, the online platform www.kaggle.com offers a
variety of data sets. Finally, the restriction to German news articles might have prevented
participants who attempted to evaluate content-based approach but do not speak German.


3.3   Evaluation of Task 2

Throughout the pre-defined evaluation periods, we observed 13 active participants on the
ORP. Unfortunately, the component recording the performance failed twice. Thus, we did
not receive data for the times between 7-12 February and 27-31 May. None of the teams
were active in all periods. This illustrates the technical requirements which participants
faced. ORP automatically disables the communication with participants in case their
servers do not respond in time. ORP tries to re-establish the communication. We noticed
that the re-establishing has not succeeded in all occasions. We allowed participants to
simultaneously deploy several algorithms. Some participants used this more extensively
than other did. Table 1 shows the results for the evaluation periods 7–23 February, 1–14
April, and 27–31 May. We list the number of clicks, requests and their ratio for each
algorithm which was active during the period. We note that the number of requests does
vary between algorithms. Algorithm AL gathered the most clicks in periods 2 and 3
as well as the second most clicks in the first period. Note that the baseline constantly
appears under the five best performing algorithms. This indicates that popularity and


                                           795
recency represent two important factors when recommending news. Additionally, the
baseline provides low computational complexity such that it is able to reply to a large
fraction of requests. Table 2 aggregates the results per participant. The aggregated results
confirm our impressions from the algorithm-level.
    In addition to the overall figure, we investigate whether particular algorithms perform
exceptionally well in specific contexts. Context refers either to specific news portals
or daytimes. News portals offer varying contents. For instance, www.sport1.de is
dedicated to sports-related news while www.gulli.com provides news on information
technology. Thus, we look at the performance of individual algorithm with respect to
specific publishers. Likewise, we investigate algorithms’ performances throughout the
day. We suppose that different types of users consume news at varying hours of the
day. For instance, users reading news early in the morning may have other interests
than users reading late in the evening. This matches with the findings of [18]. Figure 1
shows a heatmap relating algorithms with the publisher and hour of day. We note that
few algorithms perform on comparable levels for all publishers and throughout the day.
This indicates that combining several recommendation algorithms in an ensemble yields
potential to obtain better performance.
    We do not know details to all recommendation algorithms. The participants who
submitted their ideas in form of working notes used different ideas. Most systems
carried a fall-back solution in terms of most-popular and/or most-recent strategies.
Additionally, participants contributed more sophisticated algorithms. These algorithms
included association rules [12], content-based recommenders [4], and ensembles of
different recommendation strategies [7]. Reportedly, more sophisticated methods had
trouble dealing with the high volume of requests. In particular the peaking hours during
lunch break were reportedly hard to handle. Our baseline method combines the notions of
most-popular and most-recent recommendation. The evaluation shows that the baseline is
hard to beat. This may be due to the technical restrictions rather than the recommendation
quality. More sophisticated method which just miss the response time limit may provide
better recommendations.


                                           796
Fig. 1: Algorithms’ click-through-rate grouped by time of day as well as by publisher.


                                        797
Table 1: Results of Task 2 grouped by the evaluation periods. We list the number of clicks, number
of requests, and their ratio for all participating algorithms. Notice that the highest numbers of
clicks per period are highlighted in bold font.

                           7–23 Feb            1–14 Apr          27–31 May
Algorithm           Clicks Requests CTR Clicks Requests CTR Clicks Requests CTR
AL                   6,426 436,094 0.01 17,220 801,078 0.02 2,519 127,928 0.02
andreas              8,649 581,243 0.01 16,004 767,039 0.02 422 30,519 0.01
baseline             2,642 256,406 0.01 5,616 418,556 0.01 663 54,912 0.01
HP                       -         -    -    56    8,501 0.01    -        -    -
inbeat-C             1,855 193,114 0.01     120 13,987 0.01      -        -    -
inbeat-R             2,118 228,930 0.01     141 16,747 0.01 1,192 106,303 0.01
inbeat-THDMLT        1,883 187,466 0.01      97 14,373 0.01      3      618 0.00
inbeat-TI            3,139 251,529 0.01 1,222 88,540 0.01        -        -    -
Insight Click            -         -    -     -        -    - 153 17,751 0.01
Insight Default          -         -    -     -        -    - 122 42,903 0.00
Insight ICF              -         -    -     -        -    -   29    2,949 0.01
Insight MLTAE            -         -    -     -        -    - 128 14,833 0.01
Insight MLTK             -         -    -     -        -    - 177 21,620 0.01
Insight MLTOE            -         -    -     -        -    - 107 11,711 0.01
Insight MLTOEC           -         -    -     -        -    -   86 11,526 0.01
Insight MLTTLD           -         -    -     -        -    - 170 19,380 0.01
Insight MLTTLHR          -         -    -     -        -    - 181 20,185 0.01
Insight MLTTT            -         -    -     -        -    -   96 12,455 0.01
Insight MP               -         -    -     -        -    - 126 16,481 0.01
Insight MPC              -         -    -     -        -    - 119 11,933 0.01
Insight MPHR             -         -    -     -        -    - 135 13,754 0.01
Insight MPLD             -         -    -     -        -    - 171 19,394 0.01
Insight MPUGL            -         -    -     -        -    - 115 11,425 0.01
Insight MPWD             -         -    -     -        -    - 130 16,553 0.01
Insight MRA              -         -    -     -        -    -   78 14,297 0.01
Insight Vote             -         -    -     -        -    - 179 17,632 0.01
LatestPopular            -         -    -    52    8,736 0.01    -        -    -
LRwRR                    -         -    - 1,333 89,428 0.01      -        -    -
Max Testing 2        1,932 106,151 0.02 6,972 337,551 0.02 151 10,192 0.01
mhTest                 521 119,067 0.00 3,914 220,303 0.02 1,102 100,604 0.01
MostPopular              -         -    -     2    8,652 0.00    -        -    -
razor                    0        57 0.00     1       81 0.01    -        -    -
Recommender 1            -         -    -    18    2,816 0.01    -        -    -
Recommender 2            -         -    -   606 35,758 0.02      -        -    -
server1              3,592 335,861 0.01       -        -    -    -        -    -
TELCONTAR ORP            0 370,616 0.00       0 874,759 0.00     -        -    -
test - recent            -         -    -   122    8,231 0.01    -        -    -
UNED-Lattice-Reco-1      -         -    -     -        -    -   61 29,543 0.00
UNED-Lattice-Reco-2      -         -    -     -        -    -   34    2,384 0.01
UNED-MostNovelItems      -         -    -     -        -    - 370 69,607 0.01
UNED-TopScoredItems      -         -    -     -        -    - 142 51,489 0.00
UP                       -         -    -    41    8,687 0.00    -        -    -


                                              798
Table 2: Aggregated results for the participants by clicks, requests, and their ratio. Column 2 refers
to publications detailing the algorithms if available.

                     Team                    Reference Clicks Requests CTR
                     labor                      [13] 26,165 1,365,100 0.02
                     abc                        [13] 25,075 1,378,801 0.02
                     inbeat                     [12] 11,770 1,101,607 0.01
                     plista GmbH                        9,055 1,765,379 0.01
                     baseline                   [13]    8,921 729,874 0.01
                     ba2014                     [17]    5,537 439,974 0.01
                     student                            3,592 335,861 0.01
                     insight                     [7]    2,425 305,094 0.01
                     recommenders.net                   1,333    89,428 0.01
                     artificial intelligence              624    38,574 0.02
                     uned                        [4]      607 153,023 0.00
                     i2r                                  151    34,576 0.00
                     TELCONTAR                              0 1,245,375 0.00


                                                799
4    Conclusion

CLEF NEWSREEL attempted to let participants evaluate their recommendation algo-
rithms. Participants could evaluate their algorithm in two varying fashions. Task 1 offered
a rich data set recorded through a one month period on 12 news portals. We removed
time slots for evaluation purposes. Participants ought to predict which articles users
would read during these held-out times. We received no contribution for this task. Task 2
enabled participants to evaluate their recommendation algorithms by interacting with
actual users. Participants could deploy their algorithms on a server which subsequently
received recommendation requests. This setting closely mirrors circumstances under
which actual recommender systems operate. Participants struggled with the high volume
of requests and the narrow response time limits. We observed that most-popular and
most-recent approaches are hard to beat due to their low complexity.


Acknowledgement

The work leading to these results has received funding (or partial funding) from the
Central Innovation Programme for SMEs of the German Federal Ministry for Eco-
nomic Affairs and Energy, as well as from the European Union’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement number 610594.


References

 1. J. Bennett and S. Lanning. The netflix prize. In KDDCup, pages 3–6, 2007.
 2. D. Billsus and M. J. Pazzani. Adaptive News Access. In P. Brusilovsky, A. Kobsa, and
    W. Nejdl, editors, The Adaptive Web, chapter 18, pages 550–570. Springer, 2007.
 3. L. Cappellato, N. Ferro, M. Halvey, and W. Kraajl. Clef 2014 labs and workshops, notebook
    papers. In CLEF 2014 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings,
    2014.
 4. A. Castellanos, A. Garcia-Serrano, and J. Cigarran. Uned @ clef-newsreel 2014. In CLEF
    2014 Labs and Workshops, Notebook Papers, 2014.
 5. P. Cremonesi. Performance of recommender algorithms on top-n recommendation tasks cate-
    gories and subject descriptors. In Proceedings of the 2010 ACM Conference on Recommender
    Systems, pages 39–46, 2010.
 6. P. Cremonesi, P. Milano, and R. Turrin. User effort vs . accuracy in rating-based elicitation.
    In 6th ACM Conferene on Recommender Systems, pages 27–34, 2012.
 7. D. Doychev, A. Lawor, and R. Rafter. An analysis of recommender algorithms for online
    news. In CLEF 2014 Labs and Workshops, Notebook Papers, 2014.
 8. A. Gunawardana. A survey of accuracy evaluation metrics of recommendation tasks. Journal
    of Machine Learning Research, 10:2935–2962, 2009.
 9. J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering
    recommender systems. ACM Trans. Inf. Syst. (TOIS), 22(1):5–53, 2004.
10. F. Hopfgartner, B. Kille, A. Lommatzsch, T. Plumbaum, T. Brodt, and T. Heintz. Benchmark-
    ing news recommendations in a living lab. In CLEF’14: Proceedings of the Fifth International
    Conference of the CLEF Initiative. Springer Verlag, 2014. to appear.


                                                800
11. B. Kille, F. Hopfgartner, T. Brodt, and T. Heinzt. The plista dataset. In Proceedings of the
    International News Recommender Systems Workshop and Challenge, 2013.
12. J. Kuchar and T. Kliegr. Inbeat: Recommender system as a service. In CLEF 2014 Labs and
    Workshops, Notebook Papers, 2014.
13. A. Lommatzsch. Real-time news recommendation using context-aware ensembles. In
    Advances in Information Retrieval, pages 51–62. Springer, 2014.
14. Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, N. Oliver, and A. Hanjalic. Climf : Learning
    to maximize reciprocal rank with collaborative less-is-more filtering. In RecSys, pages
    139–146, 2012.
15. M. Tavakolifard, J. A. Gulla, K. C. Almeroth, F. Hopfgartner, B. Kille, T. Plumbaum, A. Lom-
    matzsch, T. Brodt, A. Bucko, and T. Heintz. Workshop and challenge on news recommender
    systems. In Proceedings of the 7th ACM conference on Recommender systems, pages 481–482,
    2013.
16. S. Vargas and P. Castells. Rank and relevance in novelty and diversity metrics for recommender
    systems. Proceedings of the fifth ACM conference on Recommender systems - RecSys ’11,
    page 109, 2011.
17. S. Werner and A. Lommatzsch. Optimizing and evaluating stream-based news recommenda-
    tion algorithms. In CLEF 2014 Labs and Workshops, Notebook Papers, 2014.
18. J. Yuan, S. Marx, F. Sivrikaya, and F. Hopfgartner. When to recommend what? a study on
    the role of contextual factors in ip-based tv services. In MindTheGap’14: Proceedings of the
    MindTheGap’14 Workshop, pages 12–18. CEUR, 2014.


                                              801