RARE : A Recurrent Attentive Recommendation Engine
              for News Aggregators

       Dhruv Khattar, Vaibhav Kumar∗, Shashank Gupta, Manish Gupta†, Vasudeva Varma
                 International Institute of Information Technology Hyderabad
              {dhruv.khattar, vaibhav.kumar, shashank.gupta}@research.iiit.ac.in


                                                                      the user’s generic as well as specific interests.
                                                                      We carry out extensive experiments over three
                         Abstract                                     real-world datasets and show that RARE out-
                                                                      performs the state-of-the-art. Furthermore,
     With news stories coming from a variety of                       we also demonstrate the effectiveness of our
     sources, it is crucial for news aggregators to                   method in handling the cold start cases.
     present interesting articles to the user to max-
     imize their engagement. This creates the need
                                                                  1   Introduction
     for a news recommendation system which un-
     derstands the content of the articles as well as             A news aggregator collects news from a variety of
     accounts for the users’ preferences. Methods                 sources and presents it to the user. It would be quite
     such as Collaborative Filtering, which are well              cumbersome for a user to select articles of her choice
     known for general recommendations, are not                   from a huge list of presented articles which may per-
     suitable for news because of the short life span             tain to a variety of subjects. Hence, it becomes crucial
     of articles and because of the large number of               for such aggregators to have a recommendation system
     articles published each day. Apart from this,                to point the user to the most relevant items and thus
     such methods do not harness the information                  maximize her engagement with the site and minimize
     present in the sequence in which the articles                the time needed to find relevant content.
     are read by the user and hence are unable to                     A popular approach to the task of recommendation
     account for the specific and generic interests of            is collaborative filtering Bell and Koren (2007); Rennie
     the user which may keep changing with time.                  and Srebro (2005); Salakhutdinov et al. (2007), which
     In order to address these issues for news rec-               uses the user’s past interaction with the item to pre-
     ommendation, we propose the Recurrent At-                    dict the most relevant content. Another common ap-
     tentive Recommendation Engine (RARE).                        proach is content-based recommendations, which uses
     RARE consists of two components and utilizes                 features between items and/or users to recommend
     the distributed representations of news arti-                new items to the users based on the similarity between
     cles. The first component is used to model the               features. However, amongst the various approaches
     user’s sequential behaviour of news reading in               for collaborative filtering, Matrix Factorization (MF)
     order to understand her general interests, i.e.,             Koren (2008), is the most popular one, which projects
     to get a summary of her interests. The second                users and items into a shared latent space, using a vec-
     component utilizes an article level attention                tor of latent features to represent a user or an item.
     mechanism to understand her specific inter-                  Thereafter, a user’s interaction with an item is mod-
     ests. We feed the information obtained from                  elled as the inner product of their latent vectors.
     both the components to a Siamese Network                         However, Collaborative Filtering methods are not
     in order to make predictions which pertain to                suitable for news recommendation because news arti-
                                                                  cles have a short life span and expire quickly Zhong
    ∗
    Author had equal contribution. He can also be contacted       et al. (2015). Such methods also require a consid-
at vaibhav2@andrew.cmu.edu
    †
    Author is also a Principal Applied Researcher at Microsoft.
                                                                  erable number of interactions with an item (article)
Copyright © CIKM 2018 for the individual papers by the papers'    before making predictions which is not desirable for
authors. Copyright © CIKM 2018 for the volume as a collection     news recommendation because we would ideally want
by its editors. This volume and its papers are published under
                                                                  to start recommending articles as soon as they are
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
published. Also, they do not directly harness the in-
formation present in the sequence in which the arti-
cles were read by the user and hence fail to account
for the generic as well as specific interests of the user
which may keep changing with time. In order to ad-
dress these issues it becomes crucial to understand the
content of the news articles as well as the user’s pref-
erences. We explain this through an example in the
following paragraph.
    As can be seen from Fig. 1(A), if a user reads four
different articles belonging to tennis and football, then
we would like our model to infer that the generic in-
terests of the users lie in reading articles about sports.
Hence, this would allow articles belonging to different
topics in the sports category to be recommended to           Figure 1: In (A), the user’s sequence is used to model
the user. However, since the user reads more articles        her general interests. While in (B), the user’s specific
on tennis rather than football, we would like to give        interests are captured. In (C), the changing interests
more weight to the articles related to tennis as can be      of the user are modelled. In all these cases, sequen-
seen in Fig. 1(B). Hence, in our overall list of recom-      tial reading history of a user plays an important role.
mended articles to the user, we would like to present        Different colors represent the different topics of the
news articles related to sports amongst which articles       article.
related to tennis would be given more importance. It         Using such a network enhances the model with further
may also happen that the user suddenly starts read-          non-linearity and enables it to capture the user-article
ing articles related to business rather than sports. In      interaction in a better sense. It also allows the model
such a case we may also want to start recommending           to learn an arbitrary similarity function instead of the
articles related to business as well. This can be seen in    traditional metrics. The distributed representation of
Fig. 1(C). However, it is important to note that in all      each news article is used as input to our model. This
these cases the sequential reading history of the user       gives us the capability to recommend articles as and
is very important while generating recommendations.          when they are produced, without depending on any
    To encode this intuition, we propose a novel neural      prior user interaction with that article.
network framework namely Recurrent Attentive Rec-               To summarize, the main contributions of this work
ommendation Engine (RARE). As illustrated in Fig. 3,         are as follows.
RARE consists of two components. The first compo-
nent is based on a recurrent neural network and uses             • We present a neural network based architecture
the sequential reading history of the user as its in-              (RARE) with the following capabilities.
put. We call this the generic encoder. This helps us                 – It utilizes the content of the news articles
to identify the generic/overall interests of the users,                giving it the ability to recommend articles
i.e., it provides a summary of the user’s interests. The               as soon as they are published.
second component utilizes a recurrent neural network
with an attention mechanism to identify the specific                 – It takes into account the users’ generic as
interests of the user. We call this the specific encoder.              well as specific interests.
The part dealing with attention allows the model to at-              – It adapts to the changing interests of the
tend to articles in a differential manner, discriminating              user.
the more from the less important ones. We then con-
                                                                 • We carry out extensive experiments over three
catenate the representations obtained from both these
                                                                   real world datasets to show the effectiveness of
components and call it the unified representation of the
                                                                   our model. The results reveal that our method
users’ interests. Limiting the size of the user reading
                                                                   outperforms the state-of-the-art.
history used as inputs to both these components allows
us to adapt to the changing user preferences. We then            • We show the effectiveness of our model for solving
feed this unified representation along with the repre-             the cold-start cases as well.
sentation of the candidate article to a Siamese Net-
work and compute an element wise product between
the outputs obtained at the final layer of the sister
                                                             2     Related Work
networks, as illustrated in Fig. 2. Finally, we use a lo-    There has been extensive study on recommendation
gistic unit to compute the score for recommendation.         systems with a myriad of publications. In this section,
we aim at reviewing a representative set of approaches.

2.1   Common Approaches for Recommenda-
      tion Systems
Recommendation systems in general can be di-
vided into collaborative recommendation systems and
content-based recommendation systems. In collabora-
tive filtering based recommendations, an item is rec-
ommended to a user if similar users liked that item.
Collaborative filtering can be further divided into user
collaborative filtering, item collaborative filtering or
a hybrid of both user and item collaborative filter-                  Figure 2: RARE Model Architecture
ing. Examples of such techniques include Bayesian ma-
trix factorization Salakhutdinov and Mnih (2008), ma-         2.2   Neural Recommendation Systems
trix completion Rennie and Srebro (2005), Restricted          Early work which used neural networks Salakhutdinov
Boltzmann Machine Salakhutdinov et al. (2007), near-          et al. (2007) used a two-layer Restricted Boltzmann
est neighbour modelling Bell and Koren (2007). In             Machine (RBM) to model users’ explicit ratings on
user collaborative methods such as Bell and Koren             items. The work has been later extended to model
(2007), the algorithm first computes similarity be-           the ordinal nature of ratings Phung et al. (2009). Re-
tween every pair of users based on the items liked by         cently auto-encoders have become a popular choice for
them. Then, the scores of user-item pairs are com-            building recommendation systems Chen et al. (2012);
puted by combining scores of this item given by sim-          Sedhain et al. (2015); Strub and Mary (2015). The
ilar users. Item-based collaborative filtering Sarwar         idea of user-based AutoRec Sedhain et al. (2015) is to
et al. (2001), computes similarity between items based        learn hidden structures that can reconstruct a user’s
on the users who like both items. It then recommends          ratings given her historical ratings as inputs. In terms
items to the user based on the items she has previously       of user personalization, this approach shares a similar
liked. Finally, in user-item based collaborative filter-      spirit as the item-item model Ning and Karypis (2011);
ing, both the users and the items are projected into a        Sarwar et al. (2001) that represents a user in terms of
common vector space based on the user-item matrix             her rated item features. While previous work has lent
and then the item and user representation are com-            support for addressing collaborative filtering, most of
bined to find a recommendation. Matrix factorization          them have focused on observed ratings and modeled
based approaches like Rennie and Srebro (2005) and            the observed data only. As a result, they can easily
Salakhutdinov and Mnih (2008) are examples of such            fail to learn users’ preferences from the positive-only
a technique. One of the major drawbacks of collabo-           implicit data.
rative filtering is its inability to handle new users and         In Wu et al. (2016) a collaborative denoising auto-
new items, a problem which is often referred to as the        encoder (CDAE) for CF with implicit feedback is pre-
cold-start issue.                                             sented. In contrast to the DAE-based CF Strub and
   Another common approach for recommendation is              Mary (2015), CDAE additionally plugs a user node
content-based recommendation. In this approach, fea-          to the input of auto-encoders for reconstructing the
tures from user’s profile and/or item’s description are       user’s ratings. As shown by the authors, CDAE is
extracted and are used for recommending items to              equivalent to the SVD++ model Koren (2008) when
users. The underlying assumption is that the users            the identity function is applied to activate the hidden
tend to like items that they liked previously. In Liu         layers of CDAE. Although CDAE is a collaborative
et al. (2010), each user is modeled by a distribution         filtering model, it is solely based on item-item interac-
over news topics that is constructed from articles she        tion whereas the work which we present here is based
liked with a prior distribution of topic preferences com-     on user-item interaction. On the other hand in He
puted using all users who share the same location. A          et al. (2017), authors have explored deep neural net-
major advantage of using content-based recommenda-            works for recommendation systems. They present a
tion is that it can handle the problem of item cold-start     general framework named NCF, short for Neural Col-
as it uses item features for recommendation. For user         laborative Filtering, that replaces the inner product
cold-start, a variety of other features like age, location,   with a neural architecture that can learn an arbitrary
popularity aspects could be used. In the following we         function from the given data. It uses a multi-layer
discuss previous work on neural approaches for recom-         perceptron to learn the user-item interaction function.
mendation systems.                                            NCF is able to express and generalize matrix factoriza-
                   Figure 3: Two Components of RARE: Generic Encoder and Specific Encoder
tion. They then combine the linearity of matrix factor-  scription of the various components of the proposed
ization and non-linearity of deep neural networks for    RARE model.
modelling user-item latent structures. They call this
model as NeuMF, short for Neural Matrix Factoriza-       3.1 Task Description
tion.
                                                         Given a series of news articles read by the user, our
   Since our work also involves projecting articles and  task is to recommend articles of interest to the user.
users to a common geometric space, we review the         The implicit feedback provided by the user is avail-
work in Huang et al. (2013). They propose an effective   able to us, i.e., we have information about the articles
approach for projecting queries and documents into a     clicked by the user. Apart from this, we also have the
common low-dimensional space. The model is named         content of the news articles available at our disposal.
as Deep Structured Semantic Model (DSSM) Huang              We first select a reading history of size R for each
et al. (2013) and is effective in calculating the rele-  user. The size of the reading history determines the
vance of the document given a query by computing         number of past interactions we use for making predic-
the distance between them. Originally this model was     tions. The articles previously read by a user can be
meant for the purpose of ranking, but since the prob-    represented as [r1 , r2 , ..., rt , ..., rR ] where 1 ≤ t ≤ R.
lem of ranking has very close associations with that     Using this list as inputs to our model we need to rec-
of recommendation, DSSM was later extended to rec-       ommend a ranked list of articles which are aligned with
ommendation scenarios in Elkahky et al. (2015). In       the users’ interests.
Elkahky et al. (2015), the authors designed a DSSM
such that the first neural network contains user’s query 3.2 RARE Overview
history (and thus referred to as the user view) and the
second neural network contains implicit feedback of      We propose a novel Recurrent Attentive Recommenda-
items. The resulting model is named multi-view DNN       tion Engine (RARE) to address the problem of news
(MV-DNN) since it can incorporate item information       recommendation for news aggregators. An overview
from more than one domain and then jointly optimize      of our method can be seen in Fig. 2. The basic idea
all of them using the same loss function in DSSM.        of RARE is to build a unified representation of a
However, in Elkahky et al. (2015), the features for the  user’s interests which encapsulates both her specific
users were their search queries and features for items   and generic interests. Apart from this, using a specific
came from multiple sources (e.g., Apps, Movies/TV        amount of reading history of a user provides RARE
etc.). This makes it less adaptable by a news website    with the flexibility to adapt to the changing interests
as it requires a lot of information outside the news do- of the user. The pipeline of RARE can be described
main. However, if the work is viewed in its entirety,    as follows.
it suggests that supercharging a neural network with
                                                            • We first learn a distributed representation for each
non-linearities to project a user and an item to the
                                                              news article by combining its title and text.
same geometric space is very effective in calculating
relevance. We draw the inspiration for using Siamese        • We then fix a reading history size R, and use the
network in our model on similar grounds.                      representations of the previous R articles read by
                                                              the user as inputs to the model.
3   Model Architecture
                                                                • We come up with a unified representation of the
In this section we first introduce the news article rec-          users’ interests using recurrent neural networks
ommendation task and then provide an elaborate de-                with an attention mechanism.
  • Treating the unified representation of the user as a       Here σ is the logistic sigmoid function. ft , it , ot
    query and the representation of the candidate ar-       represent the forget, input and output gates respec-
    ticle as a document, we use a Siamese network to        tively. rt denotes the input at time t and ht denotes
    make them undergo similar transformations and           the latent state. Wf , Wi , Wo and V represent the
    supercharge them with non-linearities to discover       weight parameters respectively, while bf , bi , bo and d
    user-item interactions.                                 represent the bias parameters respectively. The forget,
                                                            input and output gates control the flow of information
3.3   Distributed Representation for News Ar-               throughout the sequence.
      ticles
We learn a 300-dimension distributed representa-            3.5   Specific Encoder
tion Le and Mikolov (2014) for each news article by
                                                            The architecture of Specific Encoder is similar to that
combining the title and text of the news articles.
                                                            of the Generic Encoder. The graphical representation
Learning such a representation allows us to
                                                            for this can be seen in Fig. 3(b). We use LSTM cells
  • Capture the overall semantics of the news article.      here as well. To capture the specific interests of the
                                                            users, i.e., to understand the deeper interests of the
  • Enables the model to come up with a represen-           user within her broader interests, we use an article
    tation for new news articles as well as of articles     level attention mechanism. This provides us with a
    with varying lengths.                                   context vector which encapsulates the specific interests
                                                            of the user. This can be represented as,
News articles generally follow an inverted pyramid
structure where the title and the first paragraph give                                  R
away the desired information. Hence, we only choose                              cs =
                                                                                        X
                                                                                              αj hj              (7)
the title and the first paragraph because it usually con-                               j=1
tains all the relevant information without delving into
detailed explanations. We also experimented by choos-          where the attention weights, αj , control the part of
ing the entire news article but found better results with   the input sequence which should be emphasized or ig-
just the first paragraph.                                   nored and hj stands for the output of the hidden units.
                                                            This attention mechanism gives RARE the capability
3.4   Generic Encoder                                       to adaptively focus more on the important items.
The inputs for the generic encoder are the represen-
tations of the articles previously read by the user.        3.6   RARE
Fig. 3(a) shows the graphical model of the network
                                                            The complete architecture of the proposed model can
used to identify generic interests in RARE. We use Re-
                                                            be seen in Fig. 4. The outputs obtained from the
current Neural Network (RNN) with Long-Short Term
                                                            specific and the generic encoder are concatenated and
Memory (LSTM) cells. LSTMs have been shown to be
                                                            then used as inputs to a Siamese network along with
capable of learning long-term dependencies Hochreiter
                                                            the candidate article.
and Schmidhuber (1997); Sutskever et al. (2014). The
                                                               For the given task, the generic encoder captures the
aim of this component is to understand the generic
                                                            overall interests of the user, i.e., it captures the sum-
(broader/overall) interests of the user. The last hid-
                                                            mary of the entire news articles read by the user. At
den state of the RNN, i.e., ht encapsulates this infor-
                                                            the same time, the specific encoder adaptively selects
mation, which we represent as cg . We can think of the
                                                            the important articles to capture the specific interests
final hidden state as the overall summary of the user’s
                                                            of the user. Hence to take advantage of both kinds
interests.
                                                            of information we concatenate the outputs of both the
   The state updates of the LSTM satisfy the following
                                                            encoders.
equations.
                                                               As shown in Fig. 3, we can see that hgt is incor-
                                                        porated into cu to provide the summarized user in-
              ft = σ Wf ht−1 , rt + bf               (1)
                                                        terests. Note that different encoding mechanisms will
               it = σ Wi ht−1 , rt + bi              (2)    be invoked in both the encoders when trained jointly.
                                                        The last hidden state of the generic encoder hgt plays
               ot = σ Wo ht−1 , rt + bo              (3)
                                                        a different role from that of hst . The former has the
              lt = tanh V ht−1 , rt + d              (4)    responsibility to encode the information present in
                                                            the sequence in which the articles were read by the
                 ct = ft · ct−1 + it · lt            (5)
                                                            user. While the latter is used for computing attention
                   ht = ot · tanh(ct )               (6)    weights. Information obtained from both the encoders
                               Figure 4: Complete Architecture of the RARE System
is utilized to come up with a unified representation of    3.7 Learning
users’ interests.
                                                           Typically, to learn the model parameters, existing
                                     R                     point-wise methods Salakhutdinov and Mnih (2007)
                                                           perform regression with a squared loss. This is based
                                g
                                    X
                u    g s                    s
               c = [c ; c ] = [ht ;     αj hj ]       (8)
                                    j=1                    on the assumption that observations are generated
                                                           from a Gaussian distribution. However, in He et al.
         u
where c represents the unified representation of users’    (2017) it has been shown that such a method is not
interests.                                                 very effective when we have implicit data available.
                   u
   We then use c as inputs to one of the sister net-           Given a user u and an article x, let yˆux represent
works in the Siamese network as shown in Fig. 3. The       the   predicted score at the output layer. Training is
input to the other sister network is the learned repre-    performed      by minimizing the point-wise loss between
sentation of the candidate article. The Siamese net-       y ˆ
                                                             ux and   its  target value yux . Considering the one-class
work supercharges RARE with further non-linearities        nature    of  implicit   feedback, we can view the value of
and makes the user representation and the article rep-     y ux  as a   label   1  meaning    the item x is relevant to a
resentation go through similar transformations. In         user  u, and    0  otherwise.   The   prediction score yˆux then
Huang et al. (2013), an architecture similar to that of a  represents     how   likely an  item   x is relevant to u. Hence
Siamese network has been used for ranking documents        in  order   to   constrain   the  values    between 0 and 1, we
with respect to a query with great effectiveness. If       use  the  logistic   function.   We   then   define the likelihood
we try to draw a parallel between the query-document       function    as  follows.
problem with our task, one can see that a query in
our case is cu and the document is the representation                                     Y              Y
                                                              p(γ + , γ − |I, Θm ) =             yˆui           (1 − yˆuj ) (9)
of the candidate news article. Hence, it seems apt to
                                                                                       (u,i)∈γ +      (u,j)∈γ −
use such a network if we were to project both of these
into the same geometric space to uncover the underly-      where γ + and γ − represent the positive (observed in-
ing user-article interaction pattern. A similar sort of    teractions) and negative (unobserved interactions) ar-
technique has also been used by authors in He et al.       ticles respectively. I represents the input and Θm rep-
(2017) for modelling user-item interactions. Final pre-    resents the parameters of the model. The negative log
dictions are obtained from the Siamese network after       likelihood can then be written as follows (after rear-
the logistic on the element-wise product between the       ranging the terms).
outputs obtained from the sister networks.
   Rather than using Siamese networks, the other
                                                                          X
                                                             L=−                   yui log yˆui +(1−yui )(1−log yˆui ) (10)
choice was to use a typical encoder-decoder framework.                u,i∈γ + ∪γ −
However, a typical encoder-decoder framework is un-
able to produce out-of-vocabulary (OOV) words. In          The loss is similar to binary cross-entropy and can be
the news recommendation problem setting, each new          minimized using gradient descent methods.
published article, that has not been interacted by any         It is also worth noticing that the likelihood func-
user would act as an “OOV word”. However, it is very       tion is such that it simultaneously adjusts the model’s
crucial for a news recommender to recommend articles       parameters by maximizing the score of the relevant
as soon they are published which is why we resort to       articles and at the same time adjusts to minimize the
such a method as it allows us to handle such cases well.   score of the non-relevant articles. This is similar to
        1                                                                 0.7                                                              0.8                                                                0.6

                                                                          0.6                                                                                                                                 0.5
       0.8
                                                                                                                                           0.6
                                                                          0.5
                                                                                                                                                                                                              0.4


                                                                 NDCG@K


                                                                                                                                                                                                     NDCG@K
       0.6
HR@K


                                                                                                                                    HR@K
                                                                          0.4
                                                                                                                                           0.4                                                                0.3
       0.4                                                                0.3
                                                  RARE                                                               RARE                                                             RARE                                                               RARE
                                                  NeuMF                                                              NeuMF                                                            NeuMF                   0.2                                        NeuMF
                                                                          0.2
                                                   eALS                                                               eALS                 0.2                                         eALS                                                               eALS
       0.2
                                                   BPR                    0.1                                         BPR                                                              BPR                    0.1                                         BPR
                                                 ItemPop                                                            ItemPop                                                          ItemPop                                                            ItemPop
        0                                                                  0                                                                0                                                                  0
             0   1   2   3   4   5       6   7   8   9   10 11                  0   1   2   3   4   5       6   7   8   9   10 11                0   1   2   3   4   5       6   7   8   9   10 11                  0   1   2   3   4   5       6   7   8   9   10 11
                                     K                                                                  K                                                                K                                                                  K


Figure 5: Performance of RARE vs state-of-the-art on                                                                                Figure 6: Performance of RARE vs state-of-the-art on
CLEF NewsREEL                                                                                                                       Indonesian Dataset
what is done while ranking documents corresponding                                                                                         • BPR Rendle et al. (2009). This method uses the
to a query in Huang et al. (2013). Using such a like-                                                                                        matrix factorization method with a pairwise rank-
lihood also gives us the advantages of a ranking func-                                                                                       ing loss, which is tailored to learn to rank from im-
tion.                                                                                                                                        plicit feedback. We report the best performance
                                                                                                                                             obtained by fixing and varying the learning rate.
4            Experiments
                                                                                                                                           • eALS He et al. (2016). This is a state-of-the-art
In this section, we describe the datasets, the state-of-                                                                                     matrix factorization method for item recommen-
the-art methods, evaluation protocol along with the                                                                                          dation. It optimizes the squared loss (between ac-
settings used for learning the parameters of the model.                                                                                      tual item ratings and predicted ratings) and treats
                                                                                                                                             all unobserved interactions as negative instances
4.1              Dataset                                                                                                                     and weighting them non-uniformly by item popu-
We use three real world datasets for evaluation. First,                                                                                      larity.
we use the dataset published by CLEF NewsREEL
                                                                                                                                           • NeuMF He et al. (2017). This is a state-of-the-
2017 Hopfgartner et al. (2016). CLEF shared a dataset
                                                                                                                                             art neural matrix factorization model. It treats
which captures interactions between users and news
                                                                                                                                             the problem of generating recommendations us-
stories. It includes interactions of eight different pub-
                                                                                                                                             ing implicit feedback as a binary classification
lishing sites in the month of February 2016. The
                                                                                                                                             problem. Consequently it uses the binary cross-
recorded stream of events include 2 million notifica-
                                                                                                                                             entropy loss to optimize its model parameters.
tions, 58 thousand item updates, and 168 million rec-
ommendation requests. It also includes information                                                                                     Our method is based on user-item interactions,
like the title and text of each news article. For this                                                                              hence we mainly compare it with other user-item mod-
dataset we considered all the users who had read more                                                                               els. We leave out the comparison with other models
than 10 articles after which we get a total of 22229                                                                                like SLIM Ning and Karypis (2011) and CDAE Wu
users. The other two datasets are provided by a pop-                                                                                et al. (2016) because these are item-item models and
ular news aggregation website (name omitted for re-                                                                                 hence performance difference may be caused by the
view). The second dataset contains a list of articles                                                                               user models for personalization.
read by 10297 users in an Indian language, Malay-
alam. The third dataset contains a list of articles read                                                                            4.3              Evaluation Protocol
by 22848 users in Indonesian. We make the code pub-
licly available 1 .                                                                                                                 To evaluate the performance of the recommended item
                                                                                                                                    we use the leave-one-out evaluation strategy which has
4.2              Baselines                                                                                                          been widely adopted in literature Bayer et al. (2017);
                                                                                                                                    He et al. (2016); Rendle et al. (2009). For each user
We compare our proposed approach with the following                                                                                 we held-out her latest interaction as the test instance
methods.                                                                                                                            and utilized the remaining data for training. Since it is
                                                                                                                                    time consuming to rank all items for every user during
       • ItemPop. News articles are ranked by their pop-
                                                                                                                                    evaluation, we followed the popular strategy Elkahky
         ularity judged by their number of interactions.
                                                                                                                                    et al. (2015); Koren (2008) that randomly samples 100
         This is a non-personalized method to benchmark
                                                                                                                                    items that the user has not interacted with, ranking
         the recommendation performance Rendle et al.
                                                                                                                                    the test item among the 100 items. The performance of
         (2009).
                                                                                                                                    a ranked list is judged by Hit Ratio (HR) and Normal-
       1 https://github.com/dhruvkhattar/RARE                                                                                       ized Discounted Cumulative Gain (NDCG) He et al.
          1                                                                            0.8                                                                                                                                          0.7

                                                                                                                                                                 0.9
         0.8
                                                                                       0.6
                                                                                                                                                                 0.8                                                                0.6


                                                                            NDCG@K


                                                                                                                                                                                                                           NDCG@K
         0.6
HR@K


                                                                                                                                                          HR@K
                                                                                       0.4                                                                       0.7
         0.4                                                RARE                                                                          RARE                                                                                      0.5
                                                                                                                                                                 0.6
                                                            NeuMF                                                                         NeuMF
                                                             eALS                      0.2                                                 eALS                                                                LSTM                                                               LSTM
         0.2                                                                                                                                                     0.5
                                                             BPR                                                                           BPR                                                                 RNN                                                                RNN
                                                           ItemPop                                                                       ItemPop                                                               GRU                  0.4                                           GRU
                                                                                                                                                                 0.4
          0                                                                             0
               0   1       2   3     4    5       6   7    8   9    10 11                    0   1       2   3     4    5       6   7    8   9    10 11                0   1   2   3   4   5       6   7   8   9   10 11                  0   1   2   3   4   5       6   7   8   9   10 11
                                              K                                                                             K                                                                  K                                                                  K


Figure 7: Performance of RARE vs state-of-the-art on                                                                                                      Figure 9: Performance of RARE w.r.t Recurrent Unit
Malayalam Dataset                                                                                                                                         used in RARE on NewsREEL
                                                                                                                                                          5.1              Performance Comparison with Baselines
        0.94                                                                          0.68
                                                                                                                                                          For MF based methods like BPR and eALS, the num-
                                                                                      0.67
        0.93                                                                                                                                              ber of predictive factors chosen is equal to the number
                                                                            NDCG@10


                                                                                      0.66                                                                of latent factors. We report the best performance in
HR@10


        0.92
                                                                                      0.65
                                                                                                                                                          this case. For NeuMF, we vary the size of the CF lay-
        0.91
                                                                                                                                                          ers (also latent factors) to choose the best fit for our
                                                                                      0.64
                                                                                                                                                          model.
         0.9
                       8             10               12           14
                                                                                      0.63
                                                                                                     8             10               12           14
                                                                                                                                                             In Figs. 5 to 6, we compare our method with the
                                   Reading History                                                               Reading History                          baselines. Note that the performance of ItemPop mea-
                                                                                                                                                          sure was very weak and hence it does not show up
Figure 8: Performance of RARE w.r.t. the User’s                                                                                                           clearly in the graphs. The Top-K recommended lists
Reading History on NewsREEL                                                                                                                               are used where K varies from 1 to 10. It is very
(2015). We truncated the rank list at 10 for both                                                                                                         clear from Figs. 5 and 7 that RARE outperforms other
the metrics. As such, the HR@k intuitively measures                                                                                                       methods by a significant margin across all positions on
whether the test item is present in the top-k list, and                                                                                                   the NewsREEL and the Malayalam datasets respec-
the NDCG accounts for the position of the hit by as-                                                                                                      tively. Although, RARE outperforms the other meth-
signing higher scores to hits at top ranks. We calcu-                                                                                                     ods in case of Indonesian dataset as well (Fig. 6) but
lated both metrics for each test user and reported the                                                                                                    the margin is not that large. Amongst the different
average score.                                                                                                                                            baselines, the trend in the performance can be seen
                                                                                                                                                          as follows: NeuMF >eALS >BPR (in terms of both
                                                                                                                                                          HR and NDCG). Although, in Rendle et al. (2009) it
4.4                Parameter Learning                                                                                                                     has been shown that BPR can be a strong performer
                                                                                                                                                          for ranking performance owing to its pairwise rank-
We use an Intel i7-6700 CPU @ 3.40GHz which has
                                                                                                                                                          ing aware learner, we did not see the trend for our
a RAM of 32GB and a Tesla K40c GPU. We im-
                                                                                                                                                          datasets. On the other hand RARE outperforms all
plemented our proposed method using Keras Chol-
                                                                                                                                                          the other baselines in terms of NDCG as well.
let et al. (2015). We randomly divide the labeled
set into training and validation set in a 4:1 ratio.
                                                                                                                                                          5.2              Effect of Size of Reading History
We tuned the hyper-parameters of our model us-
ing the validation set. The proposed model and all                                                                                                        We vary the size of the reading history R used as inputs
its variants are learned by optimizing the log likeli-                                                                                                    to our model. From Fig. 5, one can see that the Hit Ra-
hood given by Eq. 10. We initialize the fully con-                                                                                                        tio slowly increases with the size of the reading history
nected network weights with   p the uniform distribu-                                                                                                     until a certain point after which it decreases. However,
tion in the range between   −   6/(f anin + f anout ) and                                                                                                 the NDCG keeps on increasing. We can attribute this
                                                                                                                                                          behaviour to the fact that users have diversified read-
p
   6/(f anin + f anout ) Glorot and Bengio (2010). We
used a batch size of 256 and used AdaDelta Zeiler                                                                                                         ing interests which only get effectively captured after a
(2012) as the optimizer.                                                                                                                                  substantial number of interactions have been observed.
                                                                                                                                                          However, after a while, increasing the user history of-
                                                                                                                                                          ten leads to over-specialization where the generic in-
5              Results and Analysis                                                                                                                       terests tend to overpower the specific ones. This is also
                                                                                                                                                          an indicator of the fact that the preference of a user
In this section we present the results obtained by car-                                                                                                   keeps varying and hence a window size should be cho-
rying different experiments with our method.                                                                                                              sen such that it helps the model to dynamically adapt
                                                                   0.6                                                                 0.4
         Method                HR@10       NDCG@10
     Specific Encoder           0.916        0.657                 0.5
                                                                                                                                       0.3
     Generic Encoder            0.920        0.664


                                                                                                                              NDCG@K
                                                                   0.4


                                                            HR@K
 Specific+Generic(RARE)         0.934        0.671                                                                                     0.2
                                                                   0.3

Table 1: Performance using different Encoding Mech-                0.2
                                                                                                             Cold News
                                                                                                                                       0.1
                                                                                                                                                                                 Cold News
anism on CLEF NewsREEL                                                                                       Cold User                                                           Cold User
                                                                   0.1                                                                  0
                                                                         0   1   2   3   4   5       6   7    8   9   10 11                  0   1   2   3   4   5       6   7    8   9   10 11
                                                                                                 K                                                                   K
         Layers        HR@10        NDCG@10
           128          0.913         0.659
         128→64         0.934         0.671                 Figure 10: Performance of our model on Cold-Start
       128→64→32        0.912         0.666                 cases
                                                            few such users in the other two datasets. Out of these
Table 2: Performance of RARE by changing number             74 users, we see that the HR@10 is around 0.35. This
of dense layers                                             promises us that our model is well suitable for handling
to the users changing behaviour.                            the item cold-start problem.
   For all our methods, we chose a reading history of          For user cold-start, we test our learned model over
12 for the users. We needed to make a choice between        users who had read articles in between 2 to 4 (inclu-
12 and 14, and we chose 12 because we gave more             sive) over the same dataset. Since we set the history
importance to the HR rather than the NDCG.                  size to 12, we had to set the remaining inputs to 0s.
                                                            The HR@10 score was around 0.5. We see a gradual
5.3   Effect of different Encoders                          increase in the hit rates as we increase the value of
                                                            K. The results promise the effectiveness of our model
We first note the effects on RARE by varying the kind
                                                            to handle the problem of user cold start as well. Al-
of recurrent network used. We tested our model by
                                                            though this is not exactly the user cold start problem
using LSTMs, GRUs (Gated Recurrent Units) Chung
                                                            because it still considers some number of user interac-
et al. (2014) and Vanilla RNN. From Fig. 9, the trend
                                                            tions, still it is worth noticing the performance because
in the performance can be observed as follows: LSTM
                                                            the baselines need a considerable amount of user his-
>GRU >RNN although the differences are not very
                                                            tory before making predictions. On the other hand, in
large. One of the reasons for this could be the fact
                                                            our method, we can simply use the trained model for
that an LSTM or a GRU is better able to encode the
                                                            recommending articles to users who have had very few
interests of the user as they handle long-term depen-
                                                            interactions.
dencies better.
   We also note the effects when using different vari-
                                                            5.5              Effect of Varying Layers
ants of our own model, i.e., when we replace the unified
representation in RARE with solely the specific or the      We observe the performance of our model when we
generic encoder. The results for this can be seen from      vary the number of layers used in the Siamese Network
Table 1. We note the trend in performance as follows        in our model. We experiment by varying the number
RARE >Generic Encoder >Specific Encoder. This in-           of layers along with the number of hidden units. We
dicates that merely identifying the users’ generic inter-   experiment by using one layer with size 128, two layers
ests (a summary of overall interests) is not sufficient     with sizes 128 and 64 and three layers with sizes 128,
for learning a good recommendation model. However,          64 and 32. From Table 2, we can see that the best
when we use a combination of both in RARE, we find          performance is observed in the second case.
that the recommendation performance improves which
clearly indicates that identifying both the specific and    6            Conclusion
generic interests are essential for better recommenda-
                                                            In this paper, we proposed the Recurrent Attentive
tions.
                                                            News Recommendation Engine (RARE) to address the
                                                            problem of news recommendation. We attempt to en-
5.4   Performance on Cold Start Cases
                                                            code both the generic and the specific interests of the
We then evaluated our model for the cold start cases        users. For the former we use a recurrent neural net-
as can be seen in Fig. 10. For this task we segregated      work while for the latter we use a recurrent network
users who had read a new news article in the end, i.e.,     with an attention mechanism. We use the unified rep-
they read articles which had never been seen before         resentations obtained from both these along with a
they read it. We found out that the number of such          Siamese network to make predictions. We conducted
users were 74 in the CLEF dataset. There were very          extensive experiments on three real-world datasets and
demonstrated that our method can outperform the          Frank Hopfgartner, Torben Brodt, Jonas Seiler, Ben-
state-of-the-art methods in terms of different evalu-      jamin Kille, Andreas Lommatzsch, Martha Larson,
ation metrics.                                             Roberto Turrin, and András Serény. 2016. Bench-
                                                           marking News Recommendations: The CLEF News-
References                                                 REEL Use Case. In ACM SIGIR Forum, Vol. 49.
                                                           129–136.
Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and
  Steffen Rendle. 2017. A Generic Coordinate Descent     Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
  Framework for Learning from Implicit Feedback. In        Alex Acero, and Larry Heck. 2013. Learning Deep
  WWW.                                                     Structured Semantic Models for Web Search using
                                                           Clickthrough Data. In CIKM. 2333–2338.
Robert M Bell and Yehuda Koren. 2007.           Im-
  proved Neighborhood-based Collaborative Filtering.     Yehuda Koren. 2008. Factorization Meets the Neigh-
  In KDD. 7–14.                                            borhood: A Multifaceted Collaborative Filtering
                                                           Model. In KDD. 426–434.
Minmin Chen, Zhixiang Xu, Fei Sha, and Kilian Q
 Weinberger. 2012. Marginalized Denoising Autoen-        Quoc Le and Tomas Mikolov. 2014. Distributed Rep-
 coders for Domain Adaptation. In ICML. 767–774.          resentations of Sentences and Documents. In ICML.
                                                          1188–1196.
François Chollet et al. 2015.                 Keras.
  https://github.com/fchollet/keras.                     Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen.
                                                           2010. Personalized news recommendation based
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,            on click behavior. In Proceedings of the 15th in-
  and Yoshua Bengio. 2014. Empirical Evaluation            ternational conference on Intelligent user interfaces.
  of Gated Recurrent Neural Networks on Sequence           ACM, 31–40.
  Modeling. arXiv preprint arXiv:1412.3555 (2014).
                                                         Xia Ning and George Karypis. 2011. Slim: Sparse
Ali Mamdouh Elkahky, Yang Song, and Xiaodong He.           Linear Methods for Top-n Recommender Systems.
  2015. A Multi-View Deep Learning Approach for            In ICDM. 497–506.
  Cross Domain User Modeling in Recommendation
                                                         Dinh Q Phung, Svetha Venkatesh, et al. 2009. Ordinal
  Systems. In WWW. 278–288.
                                                           Boltzmann Machines for Collaborative Filtering. In
Xavier Glorot and Yoshua Bengio. 2010. Understand-         UAI. 548–556.
  ing the Difficulty of Training Deep Feed-forward       Steffen Rendle, Christoph Freudenthaler, Zeno Gant-
  Neural Networks. In AI-Stats, Vol. 9. 249–256.           ner, and Lars Schmidt-Thieme. 2009.             BPR:
                                                           Bayesian personalized ranking from implicit feed-
Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao
                                                           back. In Proceedings of the twenty-fifth conference
  Chen. 2015. Trirank: Review-aware Explainable
                                                           on uncertainty in artificial intelligence. AUAI Press,
  Recommendation by Modeling Aspects. In CIKM.
                                                           452–461.
  1661–1670.
                                                         Jasson DM Rennie and Nathan Srebro. 2005. Fast
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie,
                                                           maximum margin matrix factorization for collabora-
  Xia Hu, and Tat-Seng Chua. 2017. Neural Collab-
                                                           tive prediction. In Proceedings of the 22nd interna-
  orative Filtering. In Proceedings of the 26th Inter-
                                                           tional conference on Machine learning. ACM, 713–
  national Conference on World Wide Web (WWW
                                                           719.
  ’17).
                                                         Ruslan Salakhutdinov and Andriy Mnih. 2007. Prob-
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and              abilistic Matrix Factorization. In NIPS, Vol. 1.
  Tat-Seng Chua. 2016. Fast matrix factorization for
  online recommendation with implicit feedback. In       Ruslan Salakhutdinov and Andriy Mnih. 2008.
  Proceedings of the 39th International ACM SIGIR         Bayesian probabilistic matrix factorization using
  conference on Research and Development in Infor-        Markov chain Monte Carlo. In Proceedings of the
  mation Retrieval. ACM, 549–558.                         25th international conference on Machine learning.
                                                          ACM, 880–887.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
  Short-Term Memory. Neural Comp. 9, 8 (1997),           Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey
  1735–1780.                                              Hinton. 2007. Restricted Boltzmann machines for
  collaborative filtering. In Proceedings of the 24th in-
  ternational conference on Machine learning. ACM,
  791–798.

Badrul Sarwar, George Karypis, Joseph Konstan, and
  John Riedl. 2001. Item-based collaborative filtering
  recommendation algorithms. In Proceedings of the
  10th international conference on World Wide Web.
  ACM, 285–295.

Suvash Sedhain, Aditya Krishna Menon, Scott Sanner,
  and Lexing Xie. 2015. Autorec: Autoencoders meet
  collaborative filtering. In Proceedings of the 24th In-
  ternational Conference on World Wide Web. ACM,
  111–112.

Florian Strub and Jeremie Mary. 2015. Collaborative
  Filtering with Stacked Denoising AutoEncoders and
  Sparse Inputs. In NIPS Workshop on ML for eCom-
  merce.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
   Sequence to Sequence Learning with Neural Net-
   works. In NIPS. 3104–3112.
Yao Wu, Christopher DuBois, Alice X Zheng, and
  Martin Ester. 2016. Collaborative Denoising Auto-
  Encoders for Top-n Recommender Systems. In
  WSDM. 153–162.
Matthew D Zeiler. 2012. ADADELTA: An Adap-
 tive Learning Rate Method.   arXiv preprint
 arXiv:1212.5701 (2012).

Erheng Zhong, Nathan Liu, Yue Shi, and Suju Ra-
  jan. 2015. Building discriminative user profiles for
  large-scale content recommendation. In Proceedings
  of the 21th ACM SIGKDD International Conference
  on Knowledge Discovery and Data Mining. ACM,
  2277–2286.