<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of CLEF NEWSREEL 2014: News Recommendation Evaluation Labs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Kille</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Torben Brodt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Heintz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Hopfgartner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Lommatzsch</string-name>
          <email>lommatzschg@dai-labor.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Seiler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAI-Labor, Technische Universita ̈t Berlin</institution>
          ,
          <addr-line>Ernst-Reuter-Platz 7, D-10587 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>plista GmbH Torstr.</institution>
          <addr-line>33-35, D-10119 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>790</fpage>
      <lpage>801</lpage>
      <abstract>
        <p>This paper summarises objectives, organisation, and results of the first news recommendation evaluation lab (NEWSREEL 2014). NEWSREEL targeted the evaluation of news recommendation algorithms in the form of a campaignstyle evaluation lab. Participants had the chance to apply two types of evaluation schemes. On the one hand, participants could apply their algorithms onto a data set. We refer to this setting as off-line evaluation. On the other hand, participants could deploy their algorithms on a server to interactively receive recommendation requests. We refer to this setting as on-line evaluation. This setting ought to reveal the actual performance of recommendation methods. The competition strived to illustrate differences between evaluation with historical data and actual users. The on-line evaluation does reflect all requirements which active recommender systems face in practise. These requirements include real-time responses and large-scale data volumes. We present the competition's results and discuss commonalities regarding participants' approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>recommender systems</kwd>
        <kwd>news</kwd>
        <kwd>on-line evaluation</kwd>
        <kwd>living lab</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The spectrum of available news continuously grows as news publishers keep producing
news items. At the same time, we observe publishers shifting from pre-dominantely
print media towards on-line news outlets. These on-line news portals confront users with
the choice between numerous news items inducing an information overload. Readers
struggle to detect relevant news items in the continuous flow of information. Therefore,
operators of news portals have established systems to support them [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The support
includes personalisation, navigation, context-awareness, and news aggregation.
      </p>
      <p>CLEF NEWSREEL focuses on support through (personalised) content selection in
form of news recommendations. We assume that users benefit as news portals adapt to
current trends, news’ relevancy, and individual tastes. News recommendation partially
includes enhanced navigation as well as context-awareness. Recommended news items
serves as a mean to quickly navigate to relevant contents. Thus, users avoid returning to
the home page to continue consuming news. In addition, news recommender systems
may take advantage of contextual factors. These factors include time, locality, along
with trends.</p>
      <p>Within NEWSREEL participants ought to find recommendation algorithms
suggesting news items for a variety of news portals. These news portals cover several domains
including general news, sports, and information technology. All news portals provide
pre-dominantely German news articles. Consequently, approximately 4 out of 5 visitors’
browsers carry location identifiers pointing to Germany, Austria, or Switzerland,
respectively. The goal of the lab was to let participants determine which of these factors play an
important role when recommending news items. The remainder of this paper is organised
as follows. Section 2 describes the two tasks and their evaluation methodology. Section 3
summarises the results of the lab and discusses difficulties reported by participants.
Section 4 concludes the paper and gives an outlook on how we attempt to continue
evaluating news recommendation algorithms.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Lab Setup</title>
      <p>
        CLEF NEWSREEL consisted of two tasks. For Task 1 we provided a data set containing
recorded interactions with news portals. We refer to Task 1 as off-line evaluation. In
addition, participants could deploy their recommendation algorithms in a living lab for
Task 2. We refer to Task 2 as on-line evaluation and to the living lab platform as the Open
Recommendation Platform (ORP)3. ORP is operated by plista4, a company that provides
content distribution as well as targeted advertising services for a variety of websites. We
dedicate a section to each task describing the goal and evaluation methodology. The
reader is refered to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for a detailed overview of the setup.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Task 1: Off-line Evaluation</title>
        <p>
          Task 1 mirrors the paradigm of formerly held recommendation challenges such as the
Netflix Prize challenge (cf. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]). As part of the challenge, Netflix released a collection
of movie ratings. Participants had to predict ratings for unknown (user, item)-pairs in a
hold-out evaluation set. Analoguously, we split a collections of interaction with news
items in training and test partitions. The initial data set has been described in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Netflix could split their data randomly. This is due to the underlying assumption that
movie preferences remain constant over time. In other words, users will continue to
(dis-)like movies they once (dis-)liked. In contrast, we refrain to assume that users will
enjoy reading news articles they once read. Conversely, we suppose that news’ relevancy
decreases relatively quickly. Thus, we relinquished to randomly select interactions for
evaluation. Instead, we randomly selected time frames which we completely removed
from the data set. We avoided a moving time-window approach, as this would have meant
to release the entire data collection. We considered 3 parameters for the randomised
sampling:
– Portal specificity
3 http://orp.plista.com
4 http://plista.com
– Interval width
– Interval frequency
        </p>
        <p>
          Portal specificity refers to the choice between using identical time intervals for all
news portals and having portal-specific intervals. The former alternative lets us treat
all portals in the same way. On the other hand, the latter alternative provides a setting
where participants may utilise information from other sources – i.e., other news portals
– which better reflects the situation actual news recommenders reflect. For instance,
articles targeting a certain event may have been published on some news portals already.
The complementary portals could use interactions with these articles to boost their own
articles. We decided to sample portal-specific time slots for evaluation. Selecting a suited
interval width represents a non-trivial task. Choosing the width too small will result in an
insufficient amount of evaluation data. Conversely, setting the width too large will entail
a rather high number of articles as well as users missing in the training data. Additionally,
the amount of interactions varies over the day and the week. For instance, we observe
considerably fewer interaction in the night than in day times. We decided not to keep
the interval width fixed, since we expected that this would remedy coincidental bad
choices. Thus, we varied the interval width in the set f30; 60; 120; 180; 240g minutes. We
observed that recommendation algorithms will struggle to provide adequate suggestion
based on training data that lacks the most recent 4 hours [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. This is mainly due to the
rapidly evolving character of news. Moreover, we observe that news portals continue
to provide new items which attract a majority of readers. The intial raw data covers a
time span of about 1 month. We faced the decision on how many time slots to remove
for evaluation. We had to avoid removing data as extensively as leaving insufficient data
for training. Conversely, we strived to obtain expressive results. We decided to sample
approximately 15 time slots per news portal. Thus, we expected to extract evaluation
data about every second day. We noticed some time slots overlapped by chance. We
decided to merge both time slots and refrained from resampling. Algorithm 1 outlines
the sampling procedure.
        </p>
        <p>Algorithm 1 Sampling Procedure
Input: set of p news portals P , set of w interval widths W , number of samples s
T ;
function SAMPLE(P; W; s)
for i 1top do
for j 1tos do</p>
        <p>T T [ random(t; w)
end for
end for
return T
end function
. set to contain the sample result
. randomly choose a time point t and interval width w</p>
        <p>
          Having created data for training and testing, we yet have to determine an evaluation
metric. Literature on recommender systems’ evaluation provides a rich set of metrics.
Metrics relating to rating prediction accuracy and item ranking are among the most
popular choices. Hereby, root mean squared error (RMSE) and mean absolute error
(MAE) are frequently used for the former evaluation setting (cf. [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ]). Supporters
of ranking-oriented evaluation favour metrics such as precision/recall [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], normalised
discounted cumulative gain [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], or mean reciprocal rank [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Rating prediction as well
as ranking-based evaluation require preferences with graded relevancy as input. However,
users do not tend to rate news articles. Thus, we cannot apply rating prediction metrics.
Also, we cannot apply ranking metrics as we lack data about the pair-wise preference
towards news items. Our data carry the signal of users interacting with news items. Thus,
we ended up to define the evaluation metric based upon the ability to correctly predict
whether an interaction will occur.
        </p>
        <p>
          Let the pair (u; i) denote user u reading news item i included in the evaluation data.
We challenged participants to select the 10 items each previously observed user would
interact with in each evaluation time slot. The choice of exactly 10 items to suggest may
appear arbitrary. We observe a majority of users interacting with only few items. Thus,
most of the suggestions are likely not correct. On the other hand, limiting the number of
suggestions to very few items entails drawbacks as well. Imagine a user who actually
reads five articles in a time slot contained in the evaluation data. Having participants
suggesting 3 items, a recommender predicting all 5 interactions correctly will appear
to perform on level with a recommender only predicting 3 interactions correctly. Thus,
requesting many suggestions will provide more sensitivity. This sensitivity allows us to
better differentiate the individual recommendation algorithms’ performances. On the
other hand, the included news portals do not provide more than 6 recommendations
at a time. Hence, requesting substantially more recommendation will induce a setting
which insufficiently reflects the actual use case. Thus, we opted for 10 suggestions
which represents a reasonable trade-off between sensitivity and reflecting the actual
scenario. Note that [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] found that 10 preferences typically suffice to provide adequate
recommendation. Finally, we define the evaluation metric according to Equation 1:
h =
        </p>
        <p>P
u2U</p>
        <p>P10</p>
        <p>j=1 I(u; ij )
10jU j
(1)
where h refers to the hitrate. I represents the indicator function returning 1 if the
predicted interaction occurred and 0 otherwise. The denominator normalises the number
of hits by the maximal number possible. Thus, the hitrate falls into the interval [0; 1].
Since most users will not exhibit 10 interactions in the evaluation time slot, we expect
the hitrate to be closer to 0.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Task 2: On-line Evaluation</title>
        <p>Task 2 follows an alternative paradigm compared to Task 1. Task 1 assures comparability
of results. This is mainly due to the fact that all participants apply their algorithm
onto identical data. Contrarily, Task 2 provides a setting where participants have to
handle similar yet not identical data. The plista GmbH has established a living lab
where researchers and practitioners can deploy their recommendation algorithms to
interact with actual users. This approach allows us to observe the actual performance of
recommendation methods. This means that our findings will reflect actual benefits for
real users. Further, we are able to observe variations throughout time and conduct studies
on large scale as we record more and more data. Conversely, evaluation on recorded
data expresses how a method would have performed. The approach does entail some
disadvantages as well. Participants had to deal with technical requirements including
response times, scalability, and availability. Deployed systems faced numerous requests
which they had to reply to in at most 100ms. This response time restriction represents a
particular challenge for participants located far from Germany where the ORP servers
are located. Network latencies might further reduce the available response time. We
offered virtual machines to participants who either had no servers at their disposal or
suffered from high network latency. As a result, these requirements allowed us to verify
how well certain recommendation algorithms adapt to real-world settings.</p>
        <p>We asked participants to deploy their recommendation algorithm to a server.
Subsequently, they connected the server to ORP which forwarded recommendation requests.
Widgets on the individual news portals’ website displayed the suggested news items to
users. ORP tracks success in terms of clicks. This opens up several ways to evaluate
participants’ performances. One option is to consider the number of clicks. Considering
the relative number of clicks by requests represents another option. Industry refers to this
metric as click-through-rate. Given a comparable number of requests, both quantities
coincide. In situations with varying number of requests, evaluation becomes tricky.
Considering the total number of clicks may bias the evaluation in favour of the participant
with more clicks. Conversely, considering the relative number of clicks per requests may
favour teams with few requests. We want to evaluate the performance of a
recommendation algorithm. ORP provides all participants with the chance to obtain similar number
of requests. We decided to consider the absolute number of clicks as decisive criteria.
Nevertheless, we additionally present the relative number of clicks per requests in our
evaluation.</p>
        <p>
          Baselines support comparing the relative performance of algorithms. We deployed a
baseline which is detailed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The baseline combines two important factors for news
recommendation: popularity and recency. We consider a fixed number of interactions
that most recently occurred. Our baseline recommends news items included in this list
that users had not previously seen. Consequently, we obtain a computationally efficient
method that inherently considers popularity and recency.
        </p>
        <p>We realised the participants’ need to tune their algorithms. For this reason, we
explicitly defined 3 evaluation periods during which performances would be logged.
Participants could improve their algorithms before as well as in between the periods. We
set the 3 periods to 7-23 February, 1-14 April, and 27-31 May.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>In this section, we detail results of CLEF NEWSREEL 2014. We start by giving some
statistics about the participation in general. Then, we discuss the results for both tasks.
Note that we unfortunately did not receive any submissions for Task 1. We provide some
considerations about reasons for this.
3.1</p>
      <sec id="sec-3-1">
        <title>Participation</title>
        <p>
          51 participants registered for Task 1. 52 participants registered for Task 2. Thereof,
no participant submitted a solution for Task 1. We observed 13 active participants for
Task 2. Note that participants had the chance to contribute several solutions for Task 2. 4
participants submitted a working notes paper to the CLEF proceedings [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation of Task 1</title>
        <p>We have not received any submissions for Task 1. Thus, we cannot report any results
on how well the future interactions could be predicted. We can think of several reasons
which may have prevented participants from submitting results. First, the data set
exhibits a large volume of more than 60GB. Thus, we required participants to process
such volumes. Participants’ available computational resources may not have allow to
iteratively optimise their recommendation algorithms for this amount of data. Second,
we imagine that participants might have preferred Task 2 over Task 1. This preferences
may be due to the interactive character as well as the rather unique chance to evaluate
algorithms with actual users’ feedback. We admit that there are rather plenty of data
set driven competition. For instance, the online platform www.kaggle.com offers a
variety of data sets. Finally, the restriction to German news articles might have prevented
participants who attempted to evaluate content-based approach but do not speak German.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluation of Task 2</title>
        <p>Throughout the pre-defined evaluation periods, we observed 13 active participants on the
ORP. Unfortunately, the component recording the performance failed twice. Thus, we did
not receive data for the times between 7-12 February and 27-31 May. None of the teams
were active in all periods. This illustrates the technical requirements which participants
faced. ORP automatically disables the communication with participants in case their
servers do not respond in time. ORP tries to re-establish the communication. We noticed
that the re-establishing has not succeeded in all occasions. We allowed participants to
simultaneously deploy several algorithms. Some participants used this more extensively
than other did. Table 1 shows the results for the evaluation periods 7–23 February, 1–14
April, and 27–31 May. We list the number of clicks, requests and their ratio for each
algorithm which was active during the period. We note that the number of requests does
vary between algorithms. Algorithm AL gathered the most clicks in periods 2 and 3
as well as the second most clicks in the first period. Note that the baseline constantly
appears under the five best performing algorithms. This indicates that popularity and
recency represent two important factors when recommending news. Additionally, the
baseline provides low computational complexity such that it is able to reply to a large
fraction of requests. Table 2 aggregates the results per participant. The aggregated results
confirm our impressions from the algorithm-level.</p>
        <p>
          In addition to the overall figure, we investigate whether particular algorithms perform
exceptionally well in specific contexts. Context refers either to specific news portals
or daytimes. News portals offer varying contents. For instance, www.sport1.de is
dedicated to sports-related news while www.gulli.com provides news on information
technology. Thus, we look at the performance of individual algorithm with respect to
specific publishers. Likewise, we investigate algorithms’ performances throughout the
day. We suppose that different types of users consume news at varying hours of the
day. For instance, users reading news early in the morning may have other interests
than users reading late in the evening. This matches with the findings of [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Figure 1
shows a heatmap relating algorithms with the publisher and hour of day. We note that
few algorithms perform on comparable levels for all publishers and throughout the day.
This indicates that combining several recommendation algorithms in an ensemble yields
potential to obtain better performance.
        </p>
        <p>
          We do not know details to all recommendation algorithms. The participants who
submitted their ideas in form of working notes used different ideas. Most systems
carried a fall-back solution in terms of most-popular and/or most-recent strategies.
Additionally, participants contributed more sophisticated algorithms. These algorithms
included association rules [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], content-based recommenders [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and ensembles of
different recommendation strategies [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Reportedly, more sophisticated methods had
trouble dealing with the high volume of requests. In particular the peaking hours during
lunch break were reportedly hard to handle. Our baseline method combines the notions of
most-popular and most-recent recommendation. The evaluation shows that the baseline is
hard to beat. This may be due to the technical restrictions rather than the recommendation
quality. More sophisticated method which just miss the response time limit may provide
better recommendations.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>CLEF NEWSREEL attempted to let participants evaluate their recommendation
algorithms. Participants could evaluate their algorithm in two varying fashions. Task 1 offered
a rich data set recorded through a one month period on 12 news portals. We removed
time slots for evaluation purposes. Participants ought to predict which articles users
would read during these held-out times. We received no contribution for this task. Task 2
enabled participants to evaluate their recommendation algorithms by interacting with
actual users. Participants could deploy their algorithms on a server which subsequently
received recommendation requests. This setting closely mirrors circumstances under
which actual recommender systems operate. Participants struggled with the high volume
of requests and the narrow response time limits. We observed that most-popular and
most-recent approaches are hard to beat due to their low complexity.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>The work leading to these results has received funding (or partial funding) from the
Central Innovation Programme for SMEs of the German Federal Ministry for
Economic Affairs and Energy, as well as from the European Union’s Seventh Framework
Programme (FP7/2007-2013) under grant agreement number 610594.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Bennett</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Lanning</surname>
          </string-name>
          .
          <article-title>The netflix prize</article-title>
          .
          <source>In KDDCup</source>
          , pages
          <fpage>3</fpage>
          -
          <lpage>6</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Billsus</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Pazzani</surname>
          </string-name>
          .
          <article-title>Adaptive News Access</article-title>
          . In P. Brusilovsky,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kobsa</surname>
          </string-name>
          , and W. Nejdl, editors,
          <source>The Adaptive Web</source>
          , chapter
          <volume>18</volume>
          , pages
          <fpage>550</fpage>
          -
          <lpage>570</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>L.</given-names>
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halvey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraajl</surname>
          </string-name>
          .
          <article-title>Clef 2014 labs and workshops, notebook papers</article-title>
          .
          <source>In CLEF 2014 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Serrano</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Cigarran</surname>
          </string-name>
          . Uned @ clef-newsreel
          <year>2014</year>
          .
          <source>In CLEF 2014 Labs and Workshops</source>
          , Notebook Papers,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          .
          <article-title>Performance of recommender algorithms on top-n recommendation tasks categories and subject descriptors</article-title>
          .
          <source>In Proceedings of the 2010 ACM Conference on Recommender Systems</source>
          , pages
          <fpage>39</fpage>
          -
          <lpage>46</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Milano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Turrin</surname>
          </string-name>
          .
          <article-title>User effort vs . accuracy in rating-based elicitation</article-title>
          .
          <source>In 6th ACM Conferene on Recommender Systems</source>
          , pages
          <fpage>27</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>D.</given-names>
            <surname>Doychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lawor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rafter</surname>
          </string-name>
          .
          <article-title>An analysis of recommender algorithms for online news</article-title>
          .
          <source>In CLEF 2014 Labs and Workshops</source>
          , Notebook Papers,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gunawardana</surname>
          </string-name>
          .
          <article-title>A survey of accuracy evaluation metrics of recommendation tasks</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>10</volume>
          :
          <fpage>2935</fpage>
          -
          <lpage>2962</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Herlocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Terveen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Riedl</surname>
          </string-name>
          .
          <article-title>Evaluating collaborative filtering recommender systems</article-title>
          .
          <source>ACM Trans. Inf. Syst. (TOIS)</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Plumbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heintz</surname>
          </string-name>
          .
          <article-title>Benchmarking news recommendations in a living lab</article-title>
          .
          <source>In CLEF'14: Proceedings of the Fifth International Conference of the CLEF Initiative</source>
          . Springer Verlag,
          <year>2014</year>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>B.</given-names>
            <surname>Kille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brodt</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Heinzt</surname>
          </string-name>
          .
          <article-title>The plista dataset</article-title>
          .
          <source>In Proceedings of the International News Recommender Systems Workshop and Challenge</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>J.</given-names>
            <surname>Kuchar</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Kliegr</surname>
          </string-name>
          .
          <article-title>Inbeat: Recommender system as a service</article-title>
          .
          <source>In CLEF 2014 Labs and Workshops</source>
          , Notebook Papers,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          .
          <article-title>Real-time news recommendation using context-aware ensembles</article-title>
          .
          <source>In Advances in Information Retrieval</source>
          , pages
          <fpage>51</fpage>
          -
          <lpage>62</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          . Climf :
          <article-title>Learning to maximize reciprocal rank with collaborative less-is-more filtering</article-title>
          .
          <source>In RecSys</source>
          , pages
          <fpage>139</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>M. Tavakolifard</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Gulla</surname>
            ,
            <given-names>K. C.</given-names>
          </string-name>
          <string-name>
            <surname>Almeroth</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hopfgartner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kille</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Plumbaum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lommatzsch</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Brodt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bucko</surname>
            , and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Heintz</surname>
          </string-name>
          .
          <article-title>Workshop and challenge on news recommender systems</article-title>
          .
          <source>In Proceedings of the 7th ACM conference on Recommender systems</source>
          , pages
          <fpage>481</fpage>
          -
          <lpage>482</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>S.</given-names>
            <surname>Vargas</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          .
          <article-title>Rank and relevance in novelty and diversity metrics for recommender systems</article-title>
          .
          <source>Proceedings of the fifth ACM conference on Recommender systems - RecSys '11, page 109</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <given-names>S.</given-names>
            <surname>Werner</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          .
          <article-title>Optimizing and evaluating stream-based news recommendation algorithms</article-title>
          .
          <source>In CLEF 2014 Labs and Workshops</source>
          , Notebook Papers,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>J. Yuan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Marx</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Sivrikaya</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hopfgartner</surname>
          </string-name>
          .
          <article-title>When to recommend what? a study on the role of contextual factors in ip-based tv services</article-title>
          .
          <source>In MindTheGap'14: Proceedings of the MindTheGap'14 Workshop</source>
          , pages
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          . CEUR,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>