Deep Neural Architecture for News
                    Recommendation

     Vaibhav Kumar, Dhruv Khattar? , Shashank Gupta, Manish Gupta, and
                            Vasudeva Varma

International Institute of Information Technology Hyderabad, Gachibowli, Telangana
                                    - 500032, India
     {vaibhav.kumar,dhruv.khattar,shashank.gupta}@research.iiit.ac.in
                            {manish.gupta,vv}@iiit.ac.in
                               https://www.iiit.ac.in


        Abstract. Deep neural networks have yielded immense success in speech
        recognition, computer vision and natural language processing. However,
        the exploration of deep neural networks for recommender systems has
        received a relatively little introspection. Also, different recommendation
        scenarios have their own issues which creates the need for different ap-
        proaches for recommendation. Specifically in news recommendation a
        major problem is that of varying user interests. In this work, we use
        deep neural networks with attention to tackle the problem of news rec-
        ommendation.
        The key factor in user-item based collaborative filtering is to identify the
        interaction between user and item features. Matrix factorization is one
        of the most common approaches for identifying this interaction. It maps
        both the users and the items into a joint latent factor space such that
        user-item interactions in that space can be modeled as inner products
        in that space. Some recent work has used deep neural networks with
        the motive to learn an arbitrary function instead of the inner product
        that is used for capturing the user-item interaction. However, directly
        adapting it for the news domain does not seem to be very suitable. This
        is because of the dynamic nature of news readership where the interests
        of the users keep changing with time. Hence, it becomes challenging
        for recommendation systems to model both user preferences as well as
        account for the interests which keep changing over time.
        We present a deep neural model, where a non-linear mapping of users and
        item features are learnt first. For learning a non-linear mapping for the
        users we use an attention-based recurrent layer in combination with fully
        connected layers. For learning the mappings for the items we use only
        fully connected layers. We then use a ranking based objective function to
        learn the parameters of the network. We also use the content of the news
        articles as features for our model. Extensive experiments on a real-world
        dataset show a significant improvement of our proposed model over the
        state-of-the-art by 4.7% (Hit Ratio@10). Along with this, we also show
        the effectiveness of our model to handle the user cold-start and item
        cold-start problems.
?
    Vaibhav Kumar and Dhruv Khattar are the corresponding authors
      Keywords: Deep Neural Networks, News Recommendation


1   INTRODUCTION

The web provides instant access to a wide variety of online news. Hence, it
becomes desirable to have a recommender system that would point a user to
the most relevant items and thus would maximize the user engagement with the
site and minimize the time for finding relevant content. With the advent of deep
learning, although recommender systems have been used with good success for
products like movies and books, but have surprisingly found very little attention
to the problem of news recommendation.
    A major approach to the task of recommendation is called collaborative fil-
tering [5][3][4] which uses the user’s past interaction with the item to predict
the most relevant content. Another common approach is content-based recom-
mendation, which uses features between items and/or users to recommend new
items to the users based on the similarity between features. However, amongst
the various approaches for collaborative filtering, matrix factorization [11] is the
most popular one, which projects users and items into a shared latent space,
using a vector of latent features to represent a user or an item. Thereafter, a
user’s interaction with an item is modeled as the inner product of their latent
vectors.
    Collaborative filtering needs a considerable amount of previous history of
interaction before it can provide high quality recommendation. This problem is
known as the typical cold start problem. For a newly established news website,
the problem would become even more severe since users have little or no history
of interaction with the site. Traditional approaches fail to produce high qual-
ity recommendation in this case. However, in practice, it has been shown that
content-based approach can handle cold start problem for new items well.
    Each recommendation scenario has its own issues which creates the need for
different approaches for building recommendation systems. For example, news
recommendation may put more focus on the freshness of the content while other
systems like that of movie recommendation may emphasize more on content
relatedness. Adding to this, specifically in the case of news, user interests keep
evolving/changing over time. It might be possible that a user who reads news
articles only pertaining to politics may suddenly develop interest in sports due
to various reasons. Hence, it becomes very crucial to account for the dynamic
changes in interests as well as come up with better recommendation. While
many existing techniques assume the user interest to be static, this assumption
seems a bit unrealistic. This suggests the need to handle temporal changes in
the interests of the users.
    Recently in [1], authors proposed a Neural Network architecture for collab-
orative filtering. They explore the use of deep neural network for learning the
interaction function from the data. Their proposed method specifically aims to
model the relationship between users and items.
     In this work, we come up with a hybrid approach that uses user-item in-
teractions and the content of the news to capture the similarity between users
and items (news). We only focus on implicit feedback (clicks and impressions)
provided by the users, i.e whether they have read a given article or not and in
what sequence were those articles read by them.
     The sequence in which the articles are read by the user encapsulates infor-
mation about the interests of the user. Capturing the interests of the user from
the sequence of read articles requires a component which should be capable of
learning long-term dependencies. LSTMs in general have shown to be suitable
for this particular task [29][30]. To capture both static and dynamic interests
which the user has developed over time, we use bidirectional LSTMs [31]. We
chose a specific amount of reading history of each user as input to the LSTMs.
Once these interests are captured, we then need to know the extent of each of the
user’s interests. We incorporate a neural attention mechanism [25] for this pur-
pose. Then, in order to capture the similarity between users and items, we need
to be able to project them to the same latent space. We adapt Deep Structured
Semantic Model (DSSM) [8] for this. DSSM was originally used for the task of
web document ranking. Later, it was adapted for the task of recommendation in
[9]. However, in [9] the features for the users are their search queries and features
for items come from multiple domains (e.g Apps, Movies/TV etc.) which makes
it difficult for a news website to directly adapt it as a lot of information outside
the news domain is required. Then, for learning the parameters of the model we
use a ranking based objective function. Finally, for recommending news articles
to the users we use the computed inner product between user and item latent
vectors.
     To summarize, the contributions in this work are as follows.

 1. We present a deep neural architecture for news recommendation in which we
    utilize the user-item interaction as well as the content of the news (items)
    to model the latent features of users and items.
 2. In order to address the changing interests of the users and the granular-
    ity/extent of these interests over time, we incorporate attentional bidirec-
    tional LSTMs which in turn helps to model the latent features of the user.
 3. We perform experiments to demonstrate the effectiveness of our model for
    the problem of news recommendation. We then perform experiments to show
    the effectiveness of our model to solve the problems of user and item cold-
    start respectively.

    The rest of the paper is organized as follows, first we review major approaches
in recommender systems followed by a discussion on works which are directly
related to ours. In Section 3, we give a brief description of the dataset used. After
that, in Section 4 we provide the architecture of our model and also depict its
similarity to matrix factorization. We then present a comprehensive empirical
study to support our claims in Section 5. Finally, we conclude and suggest future
work.
2     RELATED WORK
There has been extensive study on recommendation systems with a myriad of
publications. However, the exploration of deep neural networks on recommender
systems has received relatively less scrutiny. In this section, we aim at reviewing
a representative set of approaches that are related to our proposed approach.

2.1   Common Approaches for Recommendation
Recommendation systems in general can be divided into collaborative recommen-
dation and content based recommendation. In a narrower sense, in collaborative
filtering based recommendations, an item is recommended to a user if similar
users liked that item. Collaborative filtering can be further divided into user
collaborative filtering, item collaborative filtering or a hybrid of both user and
item collaborative filtering. Examples of such technique include Bayesian matrix
factorization [2], matrix completion [3], Restricted Boltzmann Machine [4], near-
est neighbour modelling [5] etc. In user collaborative methods such as [5], the
algorithm first computes similarity between different users based on the items
liked by them. Then, the scores of user-item pairs are computed by combining
scores of this item given by similar users. Item based collaborative filtering [6],
computes similarity between items based on the users who like both items. It
then recommends items to the user based on the items she has previously liked.
Finally, in user-item based collaborative filtering, both the users and the items
are projected into a common vector space based on the user-item matrix and
then the item and user representation are combined to find a recommendation.
Matrix factorization based approaches like [3] and [2] are examples of such a
technique. One of the major drawbacks of collaborative filtering is its inability
to handle new users and new items, a problem which is often referred as the
cold-start issue.
     Another common approach for recommendation is content-based recommen-
dation. In this approach, features from user’s profile and/or item’s are extracted
and are used for recommending items to users based on these features. The un-
derlying assumption is that the users tend to like items similar to those they
already like. In [7], each user is modeled by a distribution over news topics that
is constructed from articles she liked with a prior distribution of topic prefer-
ence computed using all users who share the same location. A major advantage
of using content-based recommendation is that it can handle the problem of
item cold-start as it uses item features for recommendation. For user cold-start,
a variety of other features like age, location, popularity aspects could be used.
In the following we discuss recommendation works which use neural networks.

2.2   Neural Network based Recommendation
Early pioneer work which used neural network was done in [4], where a two-
layer Restricted Boltzmann Machine (RBM) was used to model users explicit
ratings on items. The work has been later extended to model the ordinal nature
of ratings [22]. Recently autoencoders have become a popular choice for building
recommendation systems [24][23][19]. The idea of user-based AutoRec [23] is to
learn hidden structures that can reconstruct a user’s ratings given her histor-
ical ratings as inputs. In terms of user personalization, this approach shares a
similar spirit as the item-item model [21][6] that represent a user as her rated
item features. While previous work has lent support for addressing collaborative
filtering, most of them have focused on observed ratings and modeled the ob-
served data only. As a result, they can easily fail to learn users preference from
the positive-only implicit data.
     The work that is most relevant to our work is [18] and [1]. In [18] a collabora-
tive denoising autoencoder (CDAE) for CF with implicit feedback is presented.
In contrast to the DAE-based CF [19], CDAE additionally plugs a user node
to the input of autoencoders for reconstructing the user’s ratings. As shown by
the authors, CDAE is equivalent to the SVD++ model [11] when the identity
function is applied to activate the hidden layers of CDAE. Although CDAE is a
collaborative filtering model, it is solely based on item-item interaction whereas
the work which we present here is based on user-item interaction. On the other
hand in [1], authors have explored deep neural networks for recommender sys-
tems. They present a general framework named NCF, short for Neural Collabora-
tive Filtering that replaces the inner product with a neural architecture that can
learn an arbitrary function from the given data. It uses a multi-layer perceptron
to learn the user-item interaction function. NCF is able to express and generalize
matrix factorization. They then combine the linearity of matrix factorization and
non-linearity of deep neural networks for modelling user-item latent structures.
They call this model as NeuMF, short for Neural Matrix Factorization.

2.3   User-Item Projection
Since our work is based on user-item based collaborative filtering, we need to
project users and items to a common latent space in order to capture their
similarity and recommend items to users accordingly. One of the most effective
approaches in projecting queries and documents into a common low-dimensional
space has been shown in [8]. The model is named as Deep Semantic Structured
Model (DSSM) [8] which is effective in calculating the relevance of the document
given a query by computing the distance between them. Originally this model
was meant for the purpose of ranking, but since the problem of ranking has very
close associations with that of recommendation, DSSM was later extended to
recommendation scenarios in [9]. In [9], the authors used DSSM for recommen-
dation where the first neural network contains user’s query history (and thus
referred to as user view) and the second neural network contains implicit feed-
back of items. The resulting model is named multi-view DNN (MV-DNN) since
it can incorporate item information from more than one domain and then jointly
optimize all of them using the same loss function in DSSM. However, in [9], the
features for the users were their search queries and features for items came from
multiple sources (e.g Apps, Movies/TV etc.). This makes it less adaptable by a
news website as it requires a lot of information outside the news domain.
    For many of the approaches in recommendation systems, the objective is to
minimize the root mean squared error on the user-item matrix reconstruction.
However, in [10] it has been shown that ranking based objective function is more
effective in generating relevant recommendations.


3     MODEL ARCHITECTURE
We first briefly review DSSM and then we provide the description of our model.
We then try to show the relationship between matrix factorization and our ap-
proach.


              Fig. 1: Recurrent Attention DSSM Model Architecture


3.1   Recurrent Attention DSSM (RA-DSSM)
In the MV-DNN, the input to the user view was merely the query history of
users. In this work, we modify the way in which inputs are sent to the user
view in order to adapt it specifically for the case of news recommendation. One
of the major issues in news recommendation is that of changing user interests.
Interests of users can be classified into short term as well as long term interests.
Hence, it becomes crucial for a news recommender to identify these interests and
recommend accordingly.


3.2   Deep Semantic Structured Model

The Deep Semantic Structured Model (DSSM) [8] was proposed for the purpose
of ranking. Essentially, DSSM can be viewed as a multi-view learning model
that often composes of two or more neural networks for each individual view.
In the original two-view DSSM model, the network on the left side was meant
for query representation, whereas the networks on the right side were meant
for representing the documents. The input to these networks could be of any
arbitrary type like letter-tri-gram in the original paper or bag of unigrams used in
[9]. After that, each input vector goes through a non-linear transformation in the
feedforward neural network to output an embedding vector, which is smaller in
size than the input vector. The learning objective of the DSSM is to maximize the
cosine similarity between the two output vectors. For the purpose of training, a
set of positive examples and randomly sampled negative examples are generated
in order to minimize the cosine loss on positive examples. In [9], authors used
DSSM for recommendation where the first neural network contained the query
history of users and the second neural network contained the implicit feedback of
items (e.g News Clicks, App Downloads). The resulting model is named as multi-
view DNN (MV-DNN) since it can incorporate item information from more than
one domain and jointly optimize them using the same loss function in DSSM.


3.3   Recurrent Attention DSSM (RA-DSSM)

In the MV-DNN, the input to the user view was merely the query history of
users. In this work, we modify the way in which inputs are sent to the user
view in order to adapt it specifically for the case of news recommendation. One
of the major issues in news recommendation is that of changing user interests.
Interests of users can be classified into short term as well as long term interests.
Hence, it becomes crucial for a news recommender to identify these interests and
recommend accordingly.
    LSTMs have shown to be capable of learning long-term dependencies [29][30].
Bidirectional LSTMs on the other hand can capture past and future information
effectively. Users interests keep changing over time and at the time of recom-
mendation we need to know the current interest and the long term user interest.
Using Bidirectional LSTMs as an encoder helps us to identify interests which
the user has taken up recently (short term) as well the long term interests of the
user. For each user, we have the sequence in which news articles were read by
her. We then choose the first R read articles for each user and use it as inputs
to our bidirectional LSTMs. The forward state updates of the LSTM satisfy the
following equations
                        →
                         − →  − −        −→→−          →−
                                   ot = σ W h t−1 , →
                         ft , it , →                  −rt + b                      (1)
                          →
                          −             − →
                                       →    −         →  −
                           lt = tanh V h t−1 , →   −
                                                   rt + d                          (2)
                                →
                                −     →
                                      − −          − →
                                                   →    −
                                ct = ft · →c t−1 + it · lt                         (3)
                                    →
                                    −
                                   th =→−
                                        o · tanh(→
                                        t
                                                   −
                                                   c )
                                                     t                              (4)
                                                 →
                                                 − → − →
    here σ is the logistic sigmoid function, ft , it , − ot represent the forget, input
                                                                            →
                                                                            −
and output gates respectively. →       −
                                       rt denotes the input at time t and ht denotes
                   →
                   −           →
                               −
the latent state, bt and dt represent the bias terms. The forget, input and
                                                                                −
                                                                                →
output gates control the flow of information throughout the sequence. W and
→
−
V are matrices which represent the weights associated with the connections.
                         ←− ←     −        ←−
The backward states ( h 1 , h 2 , . . . , h R ) are computed in a similar manner as
above. The amount of reading history used as inputs to the bidirectional LSTM
is denoted by R. We then concatenate the forward and backward states to obtain
the annotations (h1 , h2 , . . . , hR ), where
                                               "→
                                                − #
                                                hi
                                          hi = ←−                                   (5)
                                                hi

    We then need to identify the extent/granularity of each interests. Recently
in [25], the effectiveness of attention mechanisms has been shown for the task of
neural machine translation. The goal of the attention mechanism in such tasks
is to derive a context vector that captures relevant source side information to
help predict the current target word. In our case, we want to use the sequence
of annotations generated by the encoder to come up with a context vector that
captures the extent of the user’s interests. Though, in a typical RNN encoder-
decoder framework [25], a context vector is generated at each time step to predict
the target word, in our case, we only need to calculate the context vector for a
single time step.
                                               R
                                               X
                                cattention =         αj hj                         (6)
                                               j=1

where, h1 ,. . . ,hR represents the sequence of annotations to which the encoder
maps the sequence of read news articles and each αj represents the respective
weight corresponding to each annotation hj . The user view (left view) of the
model can be seen in Figure 1. The input to this is a selected amount of reading
history of each user. Each ri in the figure is a news embedding of dimension 300.
    However, the right view of the DSSM remains the same as can be seen in
Figure 1. For inputs to the right view of the DSSM, we select one positive sample
i.e an article that has been read by the user (apart from those that were used as
input to the user view) and n randomly selected negative samples (articles that
have not been read by the user). Each item+ , item− used as inputs to the item
view is also an embedding of size 300.
3.4   Learning

Typically in matrix factorization, to learn the model parameters, existing point-
wise methods [13][16] perform regression with a squared loss. This is based on
the assumption that observations are generated from a Gaussian distribution.
However, in [1] it has been shown that such a method does not tally well when
we have implicit data available to us. Also, in [10] it has been shown that a rank-
ing based objective function is more suitable for the task of recommendation.
Keeping these two aspects in mind, we adapt the loss function used in DSSM
[8]. We first compute the posterior probability of a clicked news item given a
user from the relevance score using a softmax function

                                        exp(R(u, item+ ))
                     P (item+ |u) = P                                          (7)
                                       ∀item exp(R(u, item))

where u denotes the user, item+ denotes the item that was clicked by the user
and R represents the inner product function. We then maximize the likelihood
of the clicked news items given the user with the following loss function
                                       Y
                        L(Λ) = − log        P (item+ |u)                  (8)
                                      u,item+

where, Λ represents the parameters of our model.


3.5   Relation with Matrix Factorization

We now show how we could interpret our model as a special case of matrix
factorization, which is one of the most popular model for recommendation and
has been investigated extensively in literature.
    Matrix factorization models map both users and items to a joint latent factor
space of dimensionality f , such that user-item interactions are modeled as inner
products in that space. Accordingly, each item i is associated with a vector
qi ∈ Rf and each user is associated with a vector pu ∈ Rf . For a given item i,
the elements of qi measure the extent to which the item possesses those factors,
positive or negative. For a given user u, the elements of pu measure the extent of
interest the user has in items that are high on the corresponding factors, again,
positive or negative. The resulting dot product of the two vectors captures the
interaction between the user u and item i. This approximates the user u’s rating
for the item i, denoted by rui , leading to the estimate

                                   rˆui = qiT pu                               (9)

The major challenge in this is to compute qi , pu ∈ Rf . We solve this problem
by using deep neural networks. The deep neural architecture allows us to learn
a non-linear mapping for the users and the items to the same latent space. For
computing the mapping for the users, we first use a recurrent network followed
by an attention layer. Fully connected layers are then used for bringing in the
user and the items to the same latent space. In the final layer of the DSSM, we
compute the similarity between the user and the item using the dot product of
the non-linear mappings of the input vectors. The user can then be represented
as Φ(u) and the item can be represented as Φ(i) (here Φ represents the learnt
non-linear mapping). Finally we estimate the rating as,

                                 rˆui = Φ(i)T Φ(u)                             (10)

Although in [1], to compute this similarity the authors resorted to learn any
arbitrary function, we learn a non-linear transformation and then utilise the dot
product for computing the similarity.


4     EXPERIMENTS
We conduct experiments to answer the following questions:
 1. Does our proposed model outperform the state-of-the-art implicit collabo-
    rative methods? Also, how do the different variations of our model perform
    for the given task.
 2. How does our proposed model work for solving the item cold start problem?
 3. How does our proposed model work for solving the user cold start problem?

4.1   DATASET
For this work we use the dataset published by CLEF NewsREEL 2017. CLEF
NewsREEL provides an interaction platform to compare different news recom-
mender systems performance in an online as well as offline setting [32]. As a
part of their evaluation for offline setting, CLEF shared a dataset which cap-
tures interactions between users and news stories. It includes interactions of eight
different publishing sites in the month of February, 2016. The recorded stream of
events include 2 million notifications, 58 thousand item updates, and 168 million
recommendation requests. The dataset also provides other information like the
title and text of each news article, time of publication etc. Each user can be
identified by a unique id. For our task, we find out the sequence in which the
articles were read by the users. Along with this we also find out the content of
each of these read articles. Since, we rely only on implicit feedback we only need
to know whether the article was read by a user or not.

4.2   Experimental Settings
As mentioned earlier, we use the dataset provided by CLEF NewsREEL 2017.
We extract the sequence in which the articles were read by the users. For each
article we concatenate the body and the text and use gensim [12] to learn doc2vec
[27] embeddings for those. The size of the embeddings is set to 300. In the given
dataset, almost 77% of the users have read less than 3 articles. We choose users
who have read in between 10-15 (inclusive) articles for training and testing our
model for item recommendation. The frequency of users who have read more
than 15 articles varies extensively and hence we restrict ourselves to the upper
bound of 15. We then choose users who have read 2-4 articles for testing our
model for the user cold start problem. For the item cold start problem, we again
test it on users who have read in between 10-15 articles. We ensure that the
chronology of the data is kept intact.
    Evaluation Protocol : To evaluate the performance of the recommended
item we use the leave-one-out evaluation strategy which has been widely adopted
in literature [26][13][14]. For each user we held-out her latest interaction as the
test set and utilized the remaining data for training. Since it is time consuming
to rank all items for every user during evaluation, we followed the common
strategy [9][11] that randomly samples 100 items that are not interacted by the
user, ranking the test item among the 100 items. The performance of a ranked
list is judged by Hit Ratio (HR) and Normalized Discounted Cumulative gain
(NDCG) [15]. Without special mention, we truncated the rank list at 10 for
both metrics. As such, the HR@k intuitively measures whether the test item is
present in the top-k list, and the NDCG accounts for the position of the hit by
assigning higher scores to hits at top ranks. We calculated both metrics for each
test user and reported the average score.
    Baselines : We compare our proposed approach with the following methods:
 – ItemPop. News articles are ranked by their popularity judged by their num-
   ber of interactions. This is a non-personalized method to benchmark the
   recommendation performance [14].
 – BPR [14]. This method optimizes the matrix factorization method with a
   pairwise ranking loss, which is tailored to learn from implicit feedback. We
   report the best performance obtained by fixing and varying the learning rate.
 – eALS [13]. This is a state-of-the-art matrix factorization method for item
   recommendation. It optimizes the squared loss (between actual item ratings
   and predicted ratings) and treats all unobserved interactions as negative
   instances and weighting them non-uniformly by item popularity.
 – NeuMF [1]. This is a state-of-the-art neural matrix factorization model. It
   treats the problem of generating recommendation using implicit feedback as
   a binary classification problem. Consequently it uses the binary cross-entropy
   loss to optimize its model parameters.
For all the above methods we choose that number of predictive factors which
maximize the performance over our given dataset.
    Our proposed method is based on modelling user-item relationship, hence we
mainly compare it with other user-item models. We leave out the comparison
with other models like SLIM [21] and CDAE [18] because these are item-item
models and hence performance difference may be caused by the user models for
personalization.
    Parameter Settings For the all the conducted experiments we use an Intel
i7-6700 CPU @ 3.40GHz which has a RAM of 32GB and a Tesla K40c GPU.
We ran all our experiments on the GPU. We implemented our proposed method
using Keras [20]. As mentioned earlier, for each user who had read in between
10-15 (inclusive) articles we held out the last read article for our test set. We
then construct our training set as follows:
 1. We first define the reading history that we want to use. We denote the
    reading history by Rh.
 2. For each user, we use Rh number of read articles as inputs to the user view.
    Leaving the last read article out, the remaining articles are used as positive
    samples for the item view (right view) of the model.
 3. For each positive instance of a user, we randomly sample n negative in-
    stances(news items that the user has not interacted with) which are used as
    inputs for the item view of the model. We experimentally set the number of
    negative instances n to be 4.
We then randomly divide the training set into training and validation set in a
4:1 ratio. This helps us to ensure that the two sets do not overlap. We tuned the
hyper-parameters of our model using the validation set. All the model and its
variants are learnt by optimizing the log loss of Equation 8. We initialise the fully
connected
  p         network weights with
                               p the uniform distribution in the range between
− 6/(f anin + f anout) and 6/(f anin + f anout) [28] . We used a batch size
of 256 and used adadelta [17] as a gradient based optimizer for learning the
parameters of the model. Also, it is worth noticing that, just in the case of
NeuMF [1], where the size of the last layer of the deep network determines the
number of predictive factors, we can also treat the size of the last layer of our
network (just before computing the similarity) as the number of used predictive
factors.


                     0.89


                     0.88


                     0.87
             HR@10


                     0.86


                     0.85


                     0.84
                            4   5   6   7   8   9   10   11   12   13   14
                                        Reading History

           Fig. 2: HR@10 of RA-DSSM w.r.t. the User’s Reading History
                0.68

                0.65

                0.62
      NDCG@10


                0.59

                0.56

                0.53

                 0.5
                       4   5   6       7       8       9       10       11   12   13   14
                                       Reading History

   Fig. 3: NDCG@10 of RA-DSSM w.r.t. the User’s Reading History


                  1


                 0.8


                 0.6
      HR@K


                 0.4
                                                                        RA-DSSM
                                                                          NeuMF
                 0.2                                                       eALS
                                                                           BPR
                                                                         ItemPop
                  0
                       0   1   2   3       4       5       6        7    8    9   10 11
                                                       K

Fig. 4: HR@K performance of our model vs some state-of-the-art models
                 0.7

                 0.6

        NDCG@K   0.5

                 0.4

                 0.3
                                                           RA-DSSM
                 0.2                                         NeuMF
                                                              eALS
                 0.1                                          BPR
                                                            ItemPop
                  0
                       0   1   2   3   4   5       6   7    8   9   10 11
                                               K

Fig. 5: NDCG@K performance of our model vs some state-of-the-art models


                 0.6


                 0.5


                 0.4
        HR@K


                 0.3


                 0.2
                                                           Cold News
                                                           Cold User
                 0.1
                       0   1   2   3   4   5       6   7    8   9   10 11
                                               K

                 Fig. 6: HR@K of RA-DSSM on Cold-Start cases
          0.4


NDCG@K    0.3


          0.2


          0.1
                                                    Cold News
                                                    Cold User
            0
                0   1   2   3   4   5       6   7    8   9   10 11
                                        K

         Fig. 7: NDCG@K of RA-DSSM on Cold-Start cases


         0.95

         0.85

         0.75
HR@K


         0.65

         0.55
                                                         LSTM
         0.45                                            GRU
                                                         RNN
         0.35
                0   1   2   3   4   5       6   7    8   9   10 11
                                        K

             Fig. 8: HR@K for different recurrent units
                      0.7

                     0.65

                      0.6

            NDCG@K   0.55

                      0.5

                     0.45
                                                                    LSTM
                      0.4                                           GRU
                                                                    RNN
                     0.35
                            0   1   2   3   4   5       6   7   8   9   10 11
                                                    K

                       Fig. 9: NDCG@K for different recurrent units


4.3   Performance Comparison

Figure 2 and Figure 3 shows the performance of our model by varying the amount
of reading history used as inputs for the user side of RA-DSSM. Overall we see
that as we increase the amount of reading history used, the performance also
increases. This shows that a user has multiple interests which slowly get captured
as the number of articles used for the user view of RA-DSSM is increased.
    Interests of the user develop and vary with time and hence we also experi-
mented by concatenating the time at which the articles were read by each user
along with the article embeddings and used these as inputs to the model. It was
observed that there was no significant change in the performance. One of the
prime reasons for this could be that the model is able to encode the aspect of
time into itself given its sequential nature.
    Figure 4 and Figure 5 shows the performance of the Top-K recommended
lists where the ranking position K ranges from 1 to 10. We leave out the variants
of our own model here for comparison and only use the best performing model
i.e using RA-DSSM. As from the figure, it can be clearly seen that our model
shows consistent improvements over the other methods across all positions. The
reason for this can be attributed to the fact that apart from accounting for the
user’s general preferences we also account for the users changing interests and
the extent of those interests which the baselines do not incorporate directly.
We observe major improvements in the NDCG scores of our model. There is
an approximate 22% improvement over NeuMF. The reason for this is the loss
function of Equation 8 used by our model. The loss function which is optimized
for ranking, helps the model to recommend a better ranked list of items. For
baseline methods we see that eALS outperforms BPR with a margin of 2%. We
also note that ItemPop performs worst which indicates the need for modelling
user’s personalized preferences.
    We then evaluated our model for the cold start cases as can be seen in Figure
6 and Figure 7. For this task we segregated users who had read a new article in
the end i.e they read articles which had never been seen before they read it. We
found out that the number of such users were 74. Out of these 74 users, at an
HR@10 we observe that around 35% of the time we were able to recommend that
article. This promises us that our model is well suitable for handling the item
cold-start problem. For user cold-start, we test our learned model over users who
had read articles in between 2 to 4 (inclusive). The HR@10 score was around
50%. We see a gradual increase in the hit rates as we increase the value of k.
The results promise the efficiency of our model to handle the problem of user
cold start as well.
    We then note the effects on our model by varying the kind of recurrent net-
work used. We tested our model by using LSTMs, GRUs (Gated Recurrent Units)
[33] and Vanilla RNN. From Figure 8 and Figure 9, the trend in the performance
can be observed as follows: LSTM >GRU >RNN. One of the reasons for this
could be the fact that an LSTM or a GRU is better able to encode the interests
of the user. In Table 1, we note the performance by adding bidirectional units
and attention layer to the LSTM. We note that Attention BiLSTM >BiLSTM
>LSTM. We also note that the attention does indeed enable us to capture the
extent of interests as it performs slightly better than the bidirectional LSTM.

5   CONCLUSION AND FUTURE WORK
In this work we used deep neural networks for news recommendation. We com-
bined user-item collaborative filtering with the content of the read news articles
to come up with our model. We tackled the problem of changing and diverse
reading interests of users using a recurrent network combined with neural at-
tention. We also show the effectiveness of our model in solving the problem of
user cold-start and item cold-start as well. We then also show the effectiveness
of our model when using one-hot item encodings. This shows the adaptability
of our model for other recommendation scenarios which purely rely on implicit
feedback.
    In future, we would like to note the effects of learning an arbitrary function
instead of using the inner product to calculate the similarity between the user and
the item. We would also like to evaluate our model over different recommendation
scenarios. Apart from this, we would also like to explore the idea of reinforcement
based for news recommendation. Implicit feedback provided by the users could
be used to model their interests and recommend articles to them.

References
1. He, Xiangnan and Liao, Lizi and Zhang, Hanwang and Nie, Liqiang and Hu, Xia
   and Chua, Tat-Seng: Neural Collaborative Filtering. In Proceedings of the 26th
   International Conference on World Wide Web, WWW ’17
2. Ruslan Salakhutdinov and Andriy Mnih. 2008. Bayesian probabilistic matrix factor-
   ization using Markov chain Monte Carlo. In Proceedings of the 25th inter- national
   conference on Machine learning. ACM, 880–887.
3. Jasson DM Rennie and Nathan Srebro. 2005. Fast maximum margin matrix factor-
   ization for collaborative prediction. In Proceedings of the 22nd international con-
   ference on Machine learning. ACM, 713–719.
4. Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltz-
   mann machines for collaborative ltering. In Proceedings of the 24th international
   conference on Machine learning. ACM, 791–798.
5. Robert M Bell and Yehuda Koren. 2007. Improved neighborhood-based collabora-
   tive ltering. In KDD cup and workshop at the 13th ACM SIGKDD international
   conference on knowledge discovery and data mining. Citeseer, 7–14.
6. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based
   collaborative filltering recommendation algorithms. In Proceedings of the 10th in-
   ternational conference on World Wide Web. ACM, 285–295.
7. Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized news rec-
   ommendation based on click behavior. In Proceedings of the 15th international
   conference on Intelligent user interfaces. ACM, 31–40.
8. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck.
   2013. Learning deep structured semantic models for web search using clickthrough
   data. In Proceedings of the 22nd ACM international conference on Conference on
   information & knowledge management. ACM, 2333–2338.
9. Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deep learn-
   ing approach for cross domain user modeling in recommendation systems. In Pro-
   ceedings of the 24th International Conference on World Wide Web. ACM, 278–288.
10. Joonseok Lee, Samy Bengio, Seungyeon Kim, Guy Lebanon, and Yoram Singer.
   2014. Local collaborative ranking. In Proceedings of the 23rd international confer-
   ence on World wide web. ACM, 85–96.
11. Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted col-
   laborative filtering model. In Proceedings of the 14th ACM SIGKDD international
   conference on Knowledge discovery and data mining. ACM, 426–434.
12. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
   with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
   for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
   884893/en.
13. Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast ma-
   trix factorization for online recommendation with implicit feedback. In Proceedings
   of the 39th International ACM SIGIR conference on Research and Development in
   Information Retrieval. ACM, 549–558.
14. Ste en Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt- Thieme.
   2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings
   of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press,
   452–461.
15. Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. Trirank: Review-
   aware explainable recommendation by modeling aspects. In Proceedings of the 24th
   ACM International on Conference on Information and Knowledge Management.
   ACM, 1661–1670.
16. Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization..
   In Nips, Vol. 1. 2–1.
17. Matthew D Zeiler. 2012. ADADELTA: an adaptive learning rate method. arXiv
   preprint arXiv:1212.5701 (2012).
18. Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collabo-
   rative denoising auto-encoders for top-n recommender systems. In Proceedings of
   the Ninth ACM International Conference on Web Search and Data Mining. ACM,
   153–162.
19. Florian Strub and Jeremie Mary. 2015. Collaborative Filtering with Stacked De-
   noising AutoEncoders and Sparse Inputs. In NIPS Workshop on Machine Learning
   for eCommerce.
20. François Chollet and others. 2015. Keras. https://github.com/fchollet/keras.
   (2015).
21. Xia Ning and George Karypis. 2011. Slim: Sparse linear methods for top-n recom-
   mender systems. In Data Mining (ICDM), 2011 IEEE 11th International Conference
   on. IEEE, 497–506.
22. Dinh Q Phung, Svetha Venkatesh, and others. 2009. Ordinal Boltzmann ma- chines
   for collaborative ltering. In Proceedings of the Twenty- fth Conference on Uncer-
   tainty in Arti cial Intelligence. AUAI Press, 548–556.
23. Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Au-
   torec: Autoencoders meet collaborative ltering. In Proceedings of the 24th Interna-
   tional Conference on World Wide Web. ACM, 111–112.
24. Minmin Chen, Zhixiang Xu, Fei Sha, and Kilian Q Weinberger. 2012. Marginal-
   ized Denoising Autoencoders for Domain Adaptation. In Proceedings of the 29th
   International Conference on Machine Learning (ICML-12). 767–774.
25. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma- chine
   translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
   (2014).
26. Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Ste en Rendle. 2017. A
   Generic Coordinate Descent Framework for Learning from Implicit Feedback. In
   Proceedings of the 26th International Conference on World Wide Web (WWW
   ’17).
27. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
   documents. In Proceedings of the 31st International Conference on Machine Learn-
   ing (ICML-14). 1188–1196.
28. Xavier Glorot and Yoshua Bengio. 2010. Understanding the di culty of training
   deep feedforward neural networks.. In Aistats, Vol. 9. 249–256.
29. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
   computation 9, 8 (1997), 1735–1780.
30. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learn-
   ing with neural networks. In Advances in neural information processing systems.
   3104–3112.
31. Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks.
   IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
32. Frank Hopfgartner, Torben Brodt, Jonas Seiler, Benjamin Kille, Andreas Lom-
   matzsch, Martha Larson, Roberto Turrin, and András Serény. 2016. Benchmarking
   news recommendations: The clef newsreel use case. In ACM SIGIR Forum, Vol. 49.
   ACM, 129–136.
33. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.
   Empirical evaluation of gated recurrent neural networks on sequence modeling.
   arXiv preprint arXiv:1412.3555 (2014).