=Paper= {{Paper |id=Vol-2554/paper3 |storemode=property |title=On the Importance of News Content Representation in Hybrid Neural Session-based Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-2554/paper_03.pdf |volume=Vol-2554 |authors=Gabriel De Souza P. Moreira,Dietmar Jannach,Adilson Marques Da Cunha |dblpUrl=https://dblp.org/rec/conf/recsys/MoreiraJC19 }} ==On the Importance of News Content Representation in Hybrid Neural Session-based Recommender Systems== https://ceur-ws.org/Vol-2554/paper_03.pdf
    On the Importance of News Content Representation in Hybrid
            Neural Session-based Recommender Systems
    Gabriel de Souza P. Moreira*                             Dietmar Jannach                Adilson Marques da Cunha
                  CI&T                                   University of Klagenfurt                Instituto Tecnológico de
            Campinas, SP, Brazil                           Klagenfurt, Austria                          Aeronáutica
           gspmoreira@gmail.com                          dietmar.jannach@aau.at             São José dos Campos, SP, Brazil
                                                                                                       cunha@ita.br
ABSTRACT                                                               possibly thousands of new articles published each day [38]. At
News recommender systems are designed to surface relevant              the same time, these articles become outdated very quickly [5].
information for online readers by personalizing their user ex-         Second, on many news sites, we have to deal with user cold-
periences. A particular problem in that context is that online         start, when users are anonymous or not logged-in [7, 22, 25],
readers are often anonymous, which means that this personal-           which means that personalization has to be based on a few
ization can only be based on the last few recorded interactions        observed interactions (e.g., clicks) of the user.
with the user, a setting named session-based recommendation.              In many application domains of recommenders, collabora-
Another particularity of the news domain is that constantly            tive filtering techniques, which only rely on observed prefer-
fresh articles are published, which should be immediately con-         ence patterns in a user community, have proven to be highly
sidered for recommendation. To deal with this item cold-start          effective in the past. However, in the particular domain of
problem, it is important to consider the actual content of             news recommendation, the use of hybrid techniques, which
items when recommending. Hybrid approaches are therefore               also consider the actual content of a news item, have of-
often considered as the method of choice in such settings. In          ten shown to be preferable to deal with item cold-start, see
this work, we analyze the importance of considering content            e.g., [2, 8, 22, 23, 25, 26, 37, 39].
information in a hybrid neural news recommender system. We                Likewise, to deal with user cold-start issues, session-based
contrast content-aware and content-agnostic techniques and             recommendation techniques received more research interest
also explore the effects of using different content encodings.         in recent years. In these approaches, the provided recommen-
Experiments on two public datasets confirm the importance              dations are not based on long-term preference profiles, but
of adopting a hybrid approach. Furthermore, we show that               solely on adapting recommendations according to the most
the choice of the content encoding can have an impact on               recent observed interactions of the current user.
the resulting performance.                                                Technically, a number of algorithmic approaches can be
                                                                       applied for this problem, from rule-learning techniques, over
CCS CONCEPTS                                                           nearest-neighbor schemes, to more complex sequence learn-
ˆ                                                               ˆ     ing methods and deep learning approaches. For an overview
                                    
 Information systems    Recommender systems;
                                                                       see [34]. Among the neural methods, Recurrent Neural Net-
Computing methodologies   Neural networks;
                                                                       works (RNN) are a natural choice for learning sequential
                                                                       models [12, 21]. Attention mechanisms have also been used
KEYWORDS                                                               for session-based recommendation [27].
Recommender Systems; Hybrid Systems; News Recommen-                       The goal of this work is to investigate two aspects of hybrid
dation; Session-Based Recommendation; Recurrent Neural                 session-based news recommendation using neural networks.
Networks                                                               Our first goal is to understand the value of considering content
                                                                       information in a hybrid system. Second, we aim to investigate
1    INTRODUCTION & BACKGROUND                                         to what extent the choice of the mechanism for encoding the
Many of today’s major media and news aggregator websites,              articles’ textual content matters. To that purpose, we have
including The New York Times [38], The Washington Post [9],            made experiments with various encoding mechanisms, includ-
Google News [5], and Yahoo! News [39], provide automated               ing unsupervised (like Latent Semantic Analysis and doc2vec)
reading recommendations for their users. News recommen-                and supervised ones. Our experiments were made using a re-
dation, while being one of the earliest application fields of          alistic streaming-based evaluation protocol. The outcomes of
recommenders, is often still considered a challenging problem          our studies, which were based on two public datasets, confirm
for a many reasons [16].                                               the usefulness of considering content information. However,
   Among them, there are two types of cold-start problems.             the quality and detail of the content representation matters,
First, there is the permanent item cold-start problem. In              which means that care of these aspects should be taken in
the news domain, we have to deal with a constant stream of             practical settings. Second, we found that the specific docu-
                                                                       ment encoding can makes a difference in recommendations
* Also with Brazilian Aeronautics Institute of Technology.
          ©
                                                                       quality, but sometimes those differences are small. Finally,
Copyright   2019 for this paper by its authors. Use permitted under    we found that content-agnostic nearest-neighbor methods,
Creative Commons License Attribution 4.0 International (CC BY 4.0).
INRA’19, September, 2019, Copenhagen, Denmark                                                                       de Souza P. Moreira et al.


which are considered highly competitive with RNN-based
techniques in other scenarios [14, 28], were falling behind on
different performance measures compared to the used neural
approach.

2    METHODOLOGY
To conduct our experiments, we have implemented different
instantiations of our deep learning meta-architecture for news
recommendation called CHAMELEON [32, 33]. The main
component of the architecture is the Next-Article Recommen-
dation (NAR) module, which processes various types of input
features, including pre-trained Article Content Embeddings
(ACE) and contextual information about users (e.g., time,
location, device) and items (e.g., recent popularity, recency).
These inputs are provided for all clicks of a user observed in
the current session to generate next-item recommendations
based on an RNN (e.g., GRU, LSTM).
   The ACEs are produced by the Article Content Repre-
sentation (ACR) module. The input to the module is the
article’s text, represented as a sequence of word embeddings
(e.g. using Word2Vec [31]), pre-trained on a large corpus.
                                                                             Figure 1: A simplified overview of CHAMELEON.
These embeddings are further processed by feature extrac-
                                                                             The components for which we tested different vari-
tors, which can be instantiated as Convolutional Neural Net-
                                                                             ants are shaded.
works (CNN) or RNNs. The ACR module’s neural network
is trained in a supervised manner for a side task: to predict                Table 1: Alternative content processing techniques.
metadata attributes of an article, such as categories or tags.
Figure 1 illustrates how the Article Content Embeddings are                   Technique   Input       Description
used within CHAMELEON ’s processing chain to provide                          No-ACE      None        In this setting, no content representation
next-article recommendations.                                                                         is used as input.
                                                                              Supervised
   In this work, we first analyzed the importance of consider-                CNN        word2vec 5   A 1D-CNN-based model trained to classify
ing article content information for recommendations. Second,                                          the articles’ metadata (e.g., category). The
we experimented with different techniques for textual content                                         architecture combines three CNNs, with
                                                                                                      window sizes of 3, 4, and 5 to model n-
representation1 , and investigated how they might affect rec-                                         grams. The output of an intermediate layer
ommendation quality. The different variants that were tested                                          is used as textual representation. For more
2                                                                                                     details see [32, 33]
  are listed in Table 1.                                                      GRU         word2vec    Similar to the CNN-based version, a GRU
   For the experiments, CHAMELEON ’s NAR module took                                                  layer is trained to classify metadata. The
the following features as input, described in more detail in                                          outputs of the GRU layer are max-pooled
                                                                                                      to generate representations.
[33] 3 : (1) Article Content Embeddings (generated by the                     Unsupervised
different techniques presented in Table 1), (2) article meta-                 LSA       Raw text      Traditional Latent Semantic Analysis
data (category and author4 ), (3) article context (novelty                                            (LSA) [6]. We used a variation based on
                                                                                                      TF-IDF vectors [36] and Truncated SVD
and recency), (4) user context (city, region, country, device                                         [11].
type, operational system, hour of the day, day of the week,                   W2V*TF-     word2vec    TF-IDF weighted word embeddings [24], a
referrer).                                                                    IDF                     technique to represent a piece of text as the
                                                                                                      average of its word embeddings weighted
                                                                                                      by TF-IDF [36].
1                                                                             doc2vec     Raw text    Paragraph Vector (a.k.a doc2vec) [19]
  As there were some very long articles, the text was truncated after                                 learns fixed-length feature representations
the first 12 sentences, and concatenated with the title. Article Content                              from variable-length pieces of texts, which
Embeddings (ACE) produced by the selected techniques were L2 -                                        are trained via the distributed memory and
normalized to make the feature scale similar, but also to preserve high                               distributed bag of words models.
similarity scores for embeddings from similar articles.
2
  We also experimented with Sequence Autoencoders GRU (adapted
from SA-LSTM [4]) to extract textual features by reconstructing the          3   EXPERIMENTAL SETUP
sequence of input word embeddings, but this technique did not lead
to better results than the other unsupervised methods.                       We adopt a temporal offline evaluation method as proposed
3
  Note that the experiments reported here did not include the trainable      in [32, 33], which simulates a streaming flow of new user
Article ID feature used in the experiments from [33], which can lead         interactions (clicks) and articles being published. Since in
to a slightly improved accuracy, but possibly reduces the differences
observed between the content representations.                                practical environments it is highly important to quickly react
4
  Article author and user city are available only for the Adressa dataset.
5
  Portuguese: A pre-trained Word2Vec [31] skip-gram model (300 di-           Norwegian: a skip-gram model (100 dimensions) is available at
mensions) is available at http://nilc.icmc.usp.br/embeddings; and            http://vectors.nlpl.eu/repository (model #100).
On the Importance of News Content Representation in Hybrid Neural...                    INRA’19, September, 2019, Copenhagen, Denmark


to incoming events [15, 17, 30], the baseline recommender                ones viewed at least once in the last hour by any user. To
methods are constantly updated over time. CHAMELEON ’s                   measure novelty, we used the ESI-R@n metric [33], adapted
NAR module also supports online learning. The training                   from [1, 41, 42]. The metric is based on item popularity and
process of CHAMELEON emulates a streaming scenario                       returns higher values when long-tail items are among the
with mini-batches, in which each user session is used for                top-n recommendations.
training only once. Such a scalable approach is different from
other techniques, like GRU4Rec [12], which require training              3.3       Datasets
for some epochs on a larger set of past interactions to reach
                                                                         We use two public datasets from news portals:
high accuracy.
                                                                         (1) Globo.com (G1 ) dataset - Globo.com is the most popular
3.1     Evaluation Protocol                                              media company in Brazil. The dataset7 was collected at the
                                                                         G1 news portal, which has more than 80 million unique users
The evaluation process works as follows:
                                                                         and publishes over 100,000 new articles per month; and
   (1) The recommenders are continuously trained on user
                                                                         (2) SmartMedia Adressa - This dataset contains approxi-
sessions ordered by time and grouped by hours. Every five
                                                                         mately 20 million page visits from a Norwegian news por-
hours, the recommenders are evaluated on sessions from the
                                                                         tal [10]. In our experiments we used its complete version8 ,
next hour. With this interval of five hours (not a divisor of
                                                                         which includes article text and click events of about 2 million
24 hours), we cover different hours of the day for evaluation.
                                                                         users and 13,000 articles.
After the evaluation of the next hour was done, this hour
is also considered for training, until the entire dataset is                Both datasets include the textual content of the news arti-
covered.6 Note that CHAMELEON ’s model is only updated                   cles, article metadata (such as publishing date, category, and
after all events of the test hour are processed. This allows us          author), and logged user interactions (page views) with con-
to emulate a realistic production scenario where the model is            textual information. Since we are focusing on session-based
trained and deployed once an hour to serve recommendations               news recommendations and short-term users preferences, it is
for the next hour;                                                       not necessary to train algorithms for long periods. Therefore,
                                                                         and because articles become outdated very quickly, we have
(2) For each session in the test set, we incrementally reveal
                                                                         selected all available user sessions from the first 16 days for
one click after the other to the recommender, as done, e.g.,
                                                                         both datasets for our experiments.
in [12, 35];
                                                                            In a pre-processing step, like in [8, 28, 40], we organized the
(3) For each click to be predicted, we sample a random set               data into sessions using a 30 minute threshold of inactivity
containing 50 recommendable articles (the ones that received             as an indicator of a new session. Sessions were then sorted by
at least one click by any user in the preceding hour) that were          timestamp of their first click. From each session, we removed
not viewed by the user in their session (negative samples)               repeated clicks on the same article, as we are not focusing
plus the true next article (positive sample), as done in [3]             on the capability of algorithms to act as reminders as in [20].
and [18]. We then evaluate the algorithms for the task of                Sessions with only one interaction are not suitable for next-
ranking those 51 items; and                                              click prediction and were discarded. Sessions with more than
(4) Given these rankings, standard information retrieval (top-           20 interactions (stemming from outlier users with an unusual
n) metrics can be computed.                                              behavior or from bots) were truncated.
                                                                            The characteristics of the resulting pre-processed datasets
3.2     Metrics                                                          are shown in Table 2. Coincidentally, the datasets are similar
As relevant quality factors from the news domain [16], we                in many statistics, except for the total number of published
considered accuracy, item coverage, and novelty. To determine            articles, which is much higher for G1 than for the Adressa
the metrics, we took measurements at list length 10. As                  dataset.
accuracy metrics, we used the Hit Rate (HR@n), which checks
whether or not the true next item appears in the top-n ranked
                                                                         Table 2: Statistics of the datasets used for the exper-
items, and the Mean Reciprocal Rank (MRR@n), a ranking
                                                                         iments.
metric that is sensitive to the position of the true next item.
Both metrics are common when evaluating session-based                                                      Globo.com (G1)       Adressa
recommendation algorithms [12, 15, 28].                                          Language                       Portuguese    Norwegian
                                                                                 Period (days)                           16           16
   Since it is sometimes important that a news recommender                       # users                            322,897      314,661
not only focuses on a small set of items, we also considered                     # sessions                       1,048,594      982,210
                                                                                 # clicks                         2,988,181    2,648,999
Item Coverage (COV@n) as a quality criterion. We computed                        # articles                          46,033       13,820
item coverage as the number of distinct articles that appeared                   Avg. session length                   2.84         2.70
in any top-n list divided by the number of recommendable
articles [13]. In our case, the recommendable articles are the
6
 Our dataset consists of 16 days. We used the first 2 days to learn an   7
initial model for the session-based algorithms and report the averaged       https://www.kaggle.com/gspmoreira/news-portal-user-interactions-by-globocom
                                                                         8
measures after this warm-up.                                                 http://reclab.idi.ntnu.no/dataset
INRA’19, September, 2019, Copenhagen, Denmark                                                                   de Souza P. Moreira et al.


3.4      Baselines                                                                    Table 4: Results for the G1 dataset.
The baselines used in our experiments are summarized in                       Recommender HR@10    MRR@10       COV@10        ESI-R@10
Table 3. While some baselines appear conceptually simple,                     CHAMELEON with ACEs generated differently
recent work has shown that they are often able to outperform                  No-ACE      0.6281   0.3066       0.6429        6.3169
                                                                              CNN         0.6585   0.3395       0.6493        6.2874
very recent neural approaches for session-based recommenda-                   GRU         0.6585   0.3388       0.6484        6.2674
tion tasks [14, 28, 29]. Unlike neural methods like GRU4REC,                  W2V*TF-IDF  0.6575   0.3291       0.6500        6.4187
these methods can be continuously updated over time to                        LSA         0.6686* 0.3423        0.6452        6.3833
                                                                              doc2vec     0.6368   0.3119       0.6431        6.4345
take newly published articles into account. A comparison                      Baselines
of GRU4REC with some of our baselines in a streaming                          SR          0.5911   0.2889       0.2757        5.9743
scenario is provided in [15], and specifically in the news do-                Item-kNN    0.5707   0.2801       0.3892        6.5898
                                                                              CO          0.5699   0.2625       0.2496        5.5716
main in [32], which is why we do not include GRU4REC and                      RP          0.4580   0.1994       0.0220        4.4904
similar methods here.                                                         CB          0.3703   0.1746       0.6855*       8.1683*

    Table 3: Baseline recommendation algorithms.
                                                                                   Table 5: Results for the Adressa dataset.
    Association Rules-based and Neighborhood Meth-
    ods                                                                       Recommender HR@10    MRR@10       COV@10        ESI-R@10
    Co-Occurrence      Recommends articles commonly viewed to-                CHAMELEON with ACEs generated differently
    (CO)               gether with the last read article in previous          No-ACE      0.6816   0.3252       0.8185        5.2453
                       user sessions [15, 28].                                CNN         0.6860   0.3333       0.8103        5.2924
    Sequential Rules The method also uses association rules of                GRU         0.6856   0.3327       0.8096        5.2861
    (SR)               size two. It however considers the sequence
                                                                              W2V*TF-IDF  0.6913   0.3402       0.7976        5.3273
                       of the items within a session and uses a
                                                                              LSA         0.6935   0.3403       0.8013        5.3347
                       weighting function when two items do not
                                                                              doc2vec     0.6898   0.3402       0.7968        5.3417
                       immediately appear after each other [28].
                                                                              Baselines
    Item-kNN           Returns the most similar items to the last
                       read article using the cosine similarity be-           SR          0.6285   0.3020       0.4597        5.4445
                       tween their vectors of co-occurrence with              Item-kNN    0.6136   0.2769       0.5287        5.4668
                       other items within sessions. This method               CO          0.6178   0.2819       0.4198        5.0785
                       has been commonly used as a baseline for               RP          0.5647   0.2481       0.0542        4.1464
                       neural approaches, e.g., in [12].9                     CB          0.3273   0.1197       0.8807*       7.6534*
    Non-personalized Methods
    Recently Popu- This method recommends the most viewed                    the long-established LSA method was the best performing
    lar (RP)           articles within a defined set of recently ob-
                       served user interactions on the news portal           technique to represent the content for both datasets in terms
                       (e.g., clicks during the last hour). Such a           of accuracy, even when compared to more recent techniques
                       strategy proved to be very effective in the           using pre-trained word embeddings, such as the CNN and
                       2017 CLEF NewsREEL Challenge [30].
    Content-Based      For each article read by the user, this               GRU.
    (CB)               method suggests recommendable articles                   For the G1 dataset, the Hit Rates (HR) were improved by
                       with similar content to the last clicked arti-
                       cle, based on the cosine similarity of their
                                                                             around 7% and the MRR by almost 12% when using the LSA
                       Article Content Embeddings (generated by              representation instead of the No-ACE setting. For the Adressa
                       the CNN technique described in Table 1).              dataset, the difference between the No-ACE settings and the
                                                                             hybrid methods leveraging text are less pronounced. The
                                                                             improvement using LSA compared to the No-ACE setting
  Replicability. We publish the data and source code used in                 was around 2% for HR and 5% for MRR.
our experiments online10 , including the code for CHAMELEON,                    Furthermore, for the Adressa dataset, it is possible to ob-
which is implemented using TensorFlow.                                       serve that all the unsupervised methods (LSA, W2V*TF-IDF,
                                                                             and doc2vec) for generating ACEs performed better than the
4     EXPERIMENTAL RESULTS                                                   supervised ones, differently from the G1 dataset. A possible
The results for the G1 and Adressa datasets after (hyper-                    explanation can be that the supervised methods depend more
)parameter optimization for all methods are presented11 in                   on the quality and depth of the available article metadata
Tables 4 and 5.                                                              information. While the G1 dataset uses a fine-grained cate-
   Accuracy Results. In general, we can observe that consid-
                                                                             gorization scheme (461 categories), the categorization of the
ering content information is in fact highly beneficial in terms
                                                                             Adressa dataset is much more coarse (41 categories).
of recommendation accuracy. It is also possible to see that
                                                                                Among the baselines, SR leads to the best accuracy results,
the choice of the article representation matters. Surprisingly,
                                                                             but does not match the performance of the content-agnostic
9
 We also made experiments with session-based methods proposed in             No-ACE settings for an RNN. This indicates that the hybrid
[28] (e.g. V-SkNN), but they did not lead to results that were better        approach of considering additional contextual information,
than the SR and CO methods.
10
   https://github.com/gabrielspmoreira/chameleon recsys                      as done by CHAMELEON ’s NAR module in this condition,
11
   The highest values for a given metric are highlighted in bold. The        is important.
best values for the CHAMELEON configurations are printed in italics.            Recommending only based on content information (CB ),
If the best results are significantly different (𝑝 < 0.001) from all other
algorithms, they are marked with *. We used paired Student’s t-tests         as expected, does not lead to competitive accuracy results,
with Bonferroni correction for significance tests.                           because the popularity of the items is not taken into account
On the Importance of News Content Representation in Hybrid Neural...                  INRA’19, September, 2019, Copenhagen, Denmark


(which SR and neighborhood-based methods implicitly do).                       Science 41, 6 (1990), 391–407.
Recommending only recently popular articles (RP ) works                    [7] Jorge Dı́ez Peláez, David Martı́nez Rego, Amparo Alonso Betanzos,
                                                                               Óscar Luaces Rodrı́guez, and Antonio Bahamonde Rionda. 2016.
better than CB, but does not match the performance of the                      Metrical Representation of Readers and Articles in a Digital
other methods.                                                                 Newspaper. In Proceedings of the 10th ACM Conference on
                                                                               Recommender Systems (RecSys 2016).
   Coverage and Novelty. In terms of coverage (COV@10 ),                   [8] Elena Viorica Epure, Benjamin Kille, Jon Espen Ingvaldsen, Re-
the simple Content-Based (CB) method leads to the highest                      becca Deneckere, Camille Salinesi, and Sahin Albayrak. 2017.
                                                                               Recommending Personalized News in Short User Sessions. In Pro-
value, as it recommends across the entire spectrum based                       ceedings of the Eleventh ACM Conference on Recommender
solely on content similarity, without considering the popular-                 Systems (RecSys’17). 121–129.
                                                                           [9] Ryan Graff. 2015.        How the Washington Post used data
ity of the items. It is followed by the various CHAMELEON                      and natural language processing to get people to read
instantiations, where it turned out that the specifically chosen               more news. https://knightlab.northwestern.edu/2015/06/03/
content representation is not too important in this respect.                   how-the-washington-posts-clavis-tool-helps-to-make-news-personal/.
                                                                               (June 2015).
   As expected, the CB method also frequently recommends                  [10] Jon Atle Gulla, Lemei Zhang, Peng Liu, Özlem Özgöbek, and
long-tail items, which also leads to the highest value in terms                Xiaomeng Su. 2017. The Adressa dataset for news recommenda-
of novelty (ESI-R@10 ). The popularity-based method (RP ),                     tion. In Proceedings of the International Conference on Web
                                                                               Intelligence (WI’17). 1042–1048.
in contrast, leads to the lowest novelty value. From the other            [11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011.
methods, the traditional Item-KNN method, to some surprise,                    Finding structure with randomness: Probabilistic algorithms for
leads to the best novelty results, even though neighborhood-                   constructing approximate matrix decompositions. SIAM Rev. 53,
                                                                               2 (2011), 217–288.
based methods have a certain popularity bias. Looking at the              [12] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and
other configurations, using unsupervised methods to represent                  Domonkos Tikk. 2016. Session-based recommendations with re-
                                                                               current neural networks. In Proceedings of Fourth International
the text of the articles can help to drive the recommendations                 Conference on Learning Representations (ICLR’16).
a bit away from the popular ones.                                         [13] Dietmar Jannach, Lukas Lerche, Iman Kamehkhosh, and Michael
                                                                               Jugovac. 2015. What recommenders recommend: an analysis
                                                                               of recommendation biases and possible countermeasures. User
5    SUMMARY AND CONCLUSION                                                    Modeling and User-Adapted Interaction 25, 5 (2015), 427–491.
The consideration of content information for news recom-                  [14] Dietmar Jannach and Malte Ludewig. 2017. When recurrent
                                                                               neural networks meet the neighborhood for session-based recom-
mendation proved to be important in the past, and therefore                    mendation. In Proceedings of the Eleventh ACM Conference on
many hybrid systems were proposed in the literature. In this                   Recommender Systems (RecSys’17). 306–310.
work, we investigated the relative importance of incorporat-              [15] Michael Jugovac, Dietmar Jannach, and Mozhgan Karimi. 2018.
                                                                               StreamingRec: A Framework for Benchmarking Stream-based
ing content information in both streaming- and session-based                   News Recommenders. In Proceedings of the Twelfth ACM Con-
recommendation scenarios. Our experiments highlighted the                      ference on Recommender Systems (RecSys ’18). 306–310.
value of content information by showing that it helped to out-            [16] Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018.
                                                                               News recommender systems–Survey and roads ahead. Information
perform otherwise competitive baselines. Furthermore, the                      Processing & Management 54, 6 (2018), 1203–1227.
experiments also demonstrated that the choice of the article              [17] Benjamin Kille, Andreas Lommatzsch, Frank Hopfgartner, Martha
                                                                               Larson, and Torben Brodt. 2017. CLEF 2017 NewsREEL
representation can matter. However, the value of consider-                     Overview: Offline and Online Evaluation of Stream-based News
ing additional content information in the process depends                      Recommender Systems. In Working Notes of CLEF 2017 – Con-
on the quality and depth of the available data, especially                     ference and Labs of the Evaluation Forum.
                                                                          [18] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix fac-
for supervised methods. From a practical perspective, this                     torization techniques for recommender systems. IEEE Computer
indicates that quality assurance and curation of the content                   42, 8 (2009).
information can be essential to obtain better results.                    [19] Quoc Le and Tomas Mikolov. 2014. Distributed representations
                                                                               of sentences and documents. In Proceedings of the 31st Interna-
                                                                               tional Conference on Machine Learning (ICML ’14). 1188–1196.
REFERENCES                                                                [20] Lukas Lerche, Dietmar Jannach, and Malte Ludewig. 2016. On
 [1] Pablo Castells, Neil J. Hurley, and Saul Vargas. 2015. Novelty and        the Value of Reminders within E-Commerce Recommendations.
     Diversity in Recommender Systems. In Recommender Systems                  In Proceedings of the 2016 Conference on User Modeling Adap-
     Handbook, Francesco Ricci, Lior Rokach, and Bracha Shapira                tation and Personalization, (UMAP’16).
     (Eds.). Springer US, 881–918.                                        [21] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and
 [2] Wei Chu and Seung-Taek Park. 2009. Personalized recommen-                 Jun Ma. 2017. Neural Attentive Session-based Recommendation.
     dation on dynamic content using predictive bilinear models. In            In Proceedings of the 2017 ACM on Conference on Information
     Proceedings of the 18th International Conference on World                 and Knowledge Management (CIKM ’17). 1419–1428.
     Wide Web (WWW’09). 691–700.                                          [22] Lei Li, Dingding Wang, Tao Li, Daniel Knox, and Balaji Pad-
 [3] Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. 2010. Per-             manabhan. 2011. SCENE: a scalable two-stage personalized news
     formance of recommender algorithms on top-n recommendation                recommendation system. In Proceedings of the 34th Interna-
     tasks. In Proceedings of the fourth ACM Conference on Recom-              tional Conference on Research and Development in Information
     mender Systems (RecSys ’10). 39–46.                                       Retrieval (SIGIR’11). 125–134.
 [4] Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence           [23] Lei Li, Li Zheng, Fan Yang, and Tao Li. 2014. Modeling and broad-
     learning. In Advances in neural information processing systems.           ening temporal user interest in personalized news recommendation.
     3079–3087.                                                                Expert Systems with Applications 41, 7 (2014), 3168–3177.
 [5] Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam              [24] Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support
     Rajaram. 2007. Google news personalization: scalable online               vector machines and word2vec for text classification with semantic
     collaborative filtering. In Proceedings of the 16th International         features. In 14th International Conference on Cognitive Infor-
     Conference on World Wide Web (WWW’07). 271–280.                           matics & Cognitive Computing (ICCI*CC ’15). 136–140.
 [6] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K.       [25] Chen Lin, Runquan Xie, Xinjun Guan, Lei Li, and Tao Li. 2014.
     Landauer, and Richard Harshman. 1990. Indexing by latent seman-           Personalized news recommendation via implicit social experts.
     tic analysis. Journal of the American Society for Information             Information Sciences 254 (2014), 1–18.
INRA’19, September, 2019, Copenhagen, Denmark                                                                   de Souza P. Moreira et al.


[26] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personal-   [34] Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018.
     ized news recommendation based on click behavior. In Proceedings         Sequence-Aware Recommender Systems. ACM Computing Sur-
     of the 15th International Conference on Intelligent User Inter-          veys (CSUR) 51, 4 (2018), 66.
     faces (IUI ’10). 31–40.                                             [35] Massimo Quadrana, Alexandros Karatzoglou, Balázs Hidasi, and
[27] Qiao Liu, Yifu Zeng, Refuoe Mokhosi, and Haibin Zhang. 2018.             Paolo Cremonesi. 2017. Personalizing Session-based Recommen-
     STAMP: Short-Term Attention/Memory Priority Model for                    dations with Hierarchical Recurrent Neural Networks. In Proceed-
     Session-based Recommendation. In Proceedings of the 24th ACM             ings of the 11th ACM Conference on Recommender Systems
     SIGKDD International Conference on Knowledge Discovery &                 (RecSys’17). 130–137.
     Data Mining, (KDD ’18). 1831–1839.                                  [36] Juan Ramos. 2003. Using TF-IDF to determine word relevance in
[28] Malte Ludewig and Dietmar Jannach. 2018. Evaluation of Session-          document queries. In Technical Report, Department of Computer
     based Recommendation Algorithms. User-Modeling and User-                 Science, Rutgers University.
     Adapted Interaction 28, 4–5 (2018), 331–390.                        [37] Junyang Rao, Aixia Jia, Yansong Feng, and Dongyan Zhao. 2013.
[29] Malte Ludewig, Noemi Mauro, Sara Latifi, and Dietmar Jannach.            Personalized news recommendation using ontologies harvested
     2019. Performance Comparison of Neural and Non-Neural Ap-                from the web. In International Conference on Web-Age Infor-
     proaches to Session-based Recommendation. In Proceedings of              mation Management. 781–787.
     the 2019 ACM Conference on Recommender Systems (RecSys              [38] A. Spangher. 2015. Building the Next New York Times Recom-
     2019).                                                                   mendation Engine. https://open.blogs.nytimes.com/2015/08/
[30] Cornelius A Ludmann. 2017. Recommending News Articles in                 11/building-the-next-new-york-times-recommendation-engine/.
     the CLEF News Recommendation Evaluation Lab with the Data                (Aug 2015).
     Stream Management System Odysseus.. In Working Notes of the         [39] Michele Trevisiol, Luca Maria Aiello, Rossano Schifanella, and
     Conference and Labs of the Evaluation Forum (CLEF’17).                   Alejandro Jaimes. 2014. Cold-start news recommendation with
[31] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and             domain-dependent browse graph. In Proceedings of the 8th ACM
     Jeff Dean. 2013. Distributed representations of words and phrases        Conference on Recommender Systems (RecSys’14). 81–88.
     and their compositionality. In Proceedings of Advances in Neural    [40] Bartlomiej Twardowski. 2016. Modelling Contextual Information
     Information Processing Systems (NIPS ’13). 3111–3119.                    in Session-Aware Recommender Systems with Neural Networks.
[32] Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson           In Proceedings of the 10th ACM Conference on Recommender
     Marques da Cunha. 2018. News Session-Based Recommendations               Systems (RecSys’16). 273–276.
     using Deep Neural Networks. In Proceedings of the 3rd Workshop      [41] Saúl Vargas. 2015. Novelty and Diversity Evaluation and En-
     on Deep Learning for Recommender Systems (DLRS) at ACM                   hancement in Recommender Systems. PhD thesis. Universidad
     RecSys’18. 15–23.                                                        Autónoma de Madrid.
[33] Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adil-        [42] Saúl Vargas and Pablo Castells. 2011. Rank and relevance in
     son Marques da Cunha. 2019. Contextual Hybrid Session-based              novelty and diversity metrics for recommender systems. In Pro-
     News Recommendation with Recurrent Neural Networks. arXiv                ceedings of the fifth ACM Conference on Recommender Systems
     preprint arXiv:1904.10367 (2019).                                        (RecSys’11). 109–116.