<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deep Neural Architecture for News Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vaibhav Kumar</string-name>
          <email>vaibhav.kumar@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dhruv Khattar?</string-name>
          <email>dhruv.khattar@research.iiit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shashank Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manish Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasudeva Varma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Institute of Information Technology Hyderabad</institution>
          ,
          <addr-line>Gachibowli, Telangana - 500032</addr-line>
          ,
          <country>India https://</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep neural networks have yielded immense success in speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks for recommender systems has received a relatively little introspection. Also, different recommendation scenarios have their own issues which creates the need for different approaches for recommendation. Specifically in news recommendation a major problem is that of varying user interests. In this work, we use deep neural networks with attention to tackle the problem of news recommendation. The key factor in user-item based collaborative filtering is to identify the interaction between user and item features. Matrix factorization is one of the most common approaches for identifying this interaction. It maps both the users and the items into a joint latent factor space such that user-item interactions in that space can be modeled as inner products in that space. Some recent work has used deep neural networks with the motive to learn an arbitrary function instead of the inner product that is used for capturing the user-item interaction. However, directly adapting it for the news domain does not seem to be very suitable. This is because of the dynamic nature of news readership where the interests of the users keep changing with time. Hence, it becomes challenging for recommendation systems to model both user preferences as well as account for the interests which keep changing over time. We present a deep neural model, where a non-linear mapping of users and item features are learnt first. For learning a non-linear mapping for the users we use an attention-based recurrent layer in combination with fully connected layers. For learning the mappings for the items we use only fully connected layers. We then use a ranking based objective function to learn the parameters of the network. We also use the content of the news articles as features for our model. Extensive experiments on a real-world dataset show a significant improvement of our proposed model over the state-of-the-art by 4.7% (Hit Ratio@10). Along with this, we also show the effectiveness of our model to handle the user cold-start and item cold-start problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>The web provides instant access to a wide variety of online news. Hence, it
becomes desirable to have a recommender system that would point a user to
the most relevant items and thus would maximize the user engagement with the
site and minimize the time for finding relevant content. With the advent of deep
learning, although recommender systems have been used with good success for
products like movies and books, but have surprisingly found very little attention
to the problem of news recommendation.</p>
      <p>
        A major approach to the task of recommendation is called collaborative
filtering [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which uses the user’s past interaction with the item to predict
the most relevant content. Another common approach is content-based
recommendation, which uses features between items and/or users to recommend new
items to the users based on the similarity between features. However, amongst
the various approaches for collaborative filtering, matrix factorization [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is the
most popular one, which projects users and items into a shared latent space,
using a vector of latent features to represent a user or an item. Thereafter, a
user’s interaction with an item is modeled as the inner product of their latent
vectors.
      </p>
      <p>Collaborative filtering needs a considerable amount of previous history of
interaction before it can provide high quality recommendation. This problem is
known as the typical cold start problem. For a newly established news website,
the problem would become even more severe since users have little or no history
of interaction with the site. Traditional approaches fail to produce high
quality recommendation in this case. However, in practice, it has been shown that
content-based approach can handle cold start problem for new items well.</p>
      <p>Each recommendation scenario has its own issues which creates the need for
different approaches for building recommendation systems. For example, news
recommendation may put more focus on the freshness of the content while other
systems like that of movie recommendation may emphasize more on content
relatedness. Adding to this, specifically in the case of news, user interests keep
evolving/changing over time. It might be possible that a user who reads news
articles only pertaining to politics may suddenly develop interest in sports due
to various reasons. Hence, it becomes very crucial to account for the dynamic
changes in interests as well as come up with better recommendation. While
many existing techniques assume the user interest to be static, this assumption
seems a bit unrealistic. This suggests the need to handle temporal changes in
the interests of the users.</p>
      <p>
        Recently in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], authors proposed a Neural Network architecture for
collaborative filtering. They explore the use of deep neural network for learning the
interaction function from the data. Their proposed method specifically aims to
model the relationship between users and items.
      </p>
      <p>In this work, we come up with a hybrid approach that uses user-item
interactions and the content of the news to capture the similarity between users
and items (news). We only focus on implicit feedback (clicks and impressions)
provided by the users, i.e whether they have read a given article or not and in
what sequence were those articles read by them.</p>
      <p>
        The sequence in which the articles are read by the user encapsulates
information about the interests of the user. Capturing the interests of the user from
the sequence of read articles requires a component which should be capable of
learning long-term dependencies. LSTMs in general have shown to be suitable
for this particular task [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ][
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. To capture both static and dynamic interests
which the user has developed over time, we use bidirectional LSTMs [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. We
chose a specific amount of reading history of each user as input to the LSTMs.
Once these interests are captured, we then need to know the extent of each of the
user’s interests. We incorporate a neural attention mechanism [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] for this
purpose. Then, in order to capture the similarity between users and items, we need
to be able to project them to the same latent space. We adapt Deep Structured
Semantic Model (DSSM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for this. DSSM was originally used for the task of
web document ranking. Later, it was adapted for the task of recommendation in
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the features for the users are their search queries and features
for items come from multiple domains (e.g Apps, Movies/TV etc.) which makes
it difficult for a news website to directly adapt it as a lot of information outside
the news domain is required. Then, for learning the parameters of the model we
use a ranking based objective function. Finally, for recommending news articles
to the users we use the computed inner product between user and item latent
vectors.
      </p>
      <p>To summarize, the contributions in this work are as follows.
1. We present a deep neural architecture for news recommendation in which we
utilize the user-item interaction as well as the content of the news (items)
to model the latent features of users and items.
2. In order to address the changing interests of the users and the
granularity/extent of these interests over time, we incorporate attentional
bidirectional LSTMs which in turn helps to model the latent features of the user.
3. We perform experiments to demonstrate the effectiveness of our model for
the problem of news recommendation. We then perform experiments to show
the effectiveness of our model to solve the problems of user and item
coldstart respectively.</p>
      <p>The rest of the paper is organized as follows, first we review major approaches
in recommender systems followed by a discussion on works which are directly
related to ours. In Section 3, we give a brief description of the dataset used. After
that, in Section 4 we provide the architecture of our model and also depict its
similarity to matrix factorization. We then present a comprehensive empirical
study to support our claims in Section 5. Finally, we conclude and suggest future
work.</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED</title>
    </sec>
    <sec id="sec-4">
      <title>WORK</title>
      <p>There has been extensive study on recommendation systems with a myriad of
publications. However, the exploration of deep neural networks on recommender
systems has received relatively less scrutiny. In this section, we aim at reviewing
a representative set of approaches that are related to our proposed approach.
2.1</p>
      <p>
        Common Approaches for Recommendation
Recommendation systems in general can be divided into collaborative
recommendation and content based recommendation. In a narrower sense, in collaborative
filtering based recommendations, an item is recommended to a user if similar
users liked that item. Collaborative filtering can be further divided into user
collaborative filtering, item collaborative filtering or a hybrid of both user and
item collaborative filtering. Examples of such technique include Bayesian matrix
factorization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], matrix completion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Restricted Boltzmann Machine [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
nearest neighbour modelling [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] etc. In user collaborative methods such as [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the
algorithm first computes similarity between different users based on the items
liked by them. Then, the scores of user-item pairs are computed by combining
scores of this item given by similar users. Item based collaborative filtering [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
computes similarity between items based on the users who like both items. It
then recommends items to the user based on the items she has previously liked.
Finally, in user-item based collaborative filtering, both the users and the items
are projected into a common vector space based on the user-item matrix and
then the item and user representation are combined to find a recommendation.
Matrix factorization based approaches like [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are examples of such a
technique. One of the major drawbacks of collaborative filtering is its inability
to handle new users and new items, a problem which is often referred as the
cold-start issue.
      </p>
      <p>
        Another common approach for recommendation is content-based
recommendation. In this approach, features from user’s profile and/or item’s are extracted
and are used for recommending items to users based on these features. The
underlying assumption is that the users tend to like items similar to those they
already like. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], each user is modeled by a distribution over news topics that
is constructed from articles she liked with a prior distribution of topic
preference computed using all users who share the same location. A major advantage
of using content-based recommendation is that it can handle the problem of
item cold-start as it uses item features for recommendation. For user cold-start,
a variety of other features like age, location, popularity aspects could be used.
In the following we discuss recommendation works which use neural networks.
2.2
      </p>
      <p>
        Neural Network based Recommendation
Early pioneer work which used neural network was done in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where a
twolayer Restricted Boltzmann Machine (RBM) was used to model users explicit
ratings on items. The work has been later extended to model the ordinal nature
of ratings [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Recently autoencoders have become a popular choice for building
recommendation systems [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ][
        <xref ref-type="bibr" rid="ref23">23</xref>
        ][
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The idea of user-based AutoRec [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is to
learn hidden structures that can reconstruct a user’s ratings given her
historical ratings as inputs. In terms of user personalization, this approach shares a
similar spirit as the item-item model [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that represent a user as her rated
item features. While previous work has lent support for addressing collaborative
filtering, most of them have focused on observed ratings and modeled the
observed data only. As a result, they can easily fail to learn users preference from
the positive-only implicit data.
      </p>
      <p>
        The work that is most relevant to our work is [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] a
collaborative denoising autoencoder (CDAE) for CF with implicit feedback is presented.
In contrast to the DAE-based CF [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], CDAE additionally plugs a user node
to the input of autoencoders for reconstructing the user’s ratings. As shown by
the authors, CDAE is equivalent to the SVD++ model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] when the identity
function is applied to activate the hidden layers of CDAE. Although CDAE is a
collaborative filtering model, it is solely based on item-item interaction whereas
the work which we present here is based on user-item interaction. On the other
hand in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], authors have explored deep neural networks for recommender
systems. They present a general framework named NCF, short for Neural
Collaborative Filtering that replaces the inner product with a neural architecture that can
learn an arbitrary function from the given data. It uses a multi-layer perceptron
to learn the user-item interaction function. NCF is able to express and generalize
matrix factorization. They then combine the linearity of matrix factorization and
non-linearity of deep neural networks for modelling user-item latent structures.
They call this model as NeuMF, short for Neural Matrix Factorization.
2.3
      </p>
      <p>
        User-Item Projection
Since our work is based on user-item based collaborative filtering, we need to
project users and items to a common latent space in order to capture their
similarity and recommend items to users accordingly. One of the most effective
approaches in projecting queries and documents into a common low-dimensional
space has been shown in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The model is named as Deep Semantic Structured
Model (DSSM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which is effective in calculating the relevance of the document
given a query by computing the distance between them. Originally this model
was meant for the purpose of ranking, but since the problem of ranking has very
close associations with that of recommendation, DSSM was later extended to
recommendation scenarios in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the authors used DSSM for
recommendation where the first neural network contains user’s query history (and thus
referred to as user view) and the second neural network contains implicit
feedback of items. The resulting model is named multi-view DNN (MV-DNN) since
it can incorporate item information from more than one domain and then jointly
optimize all of them using the same loss function in DSSM. However, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the
features for the users were their search queries and features for items came from
multiple sources (e.g Apps, Movies/TV etc.). This makes it less adaptable by a
news website as it requires a lot of information outside the news domain.
      </p>
      <p>
        For many of the approaches in recommendation systems, the objective is to
minimize the root mean squared error on the user-item matrix reconstruction.
However, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] it has been shown that ranking based objective function is more
effective in generating relevant recommendations.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>MODEL ARCHITECTURE</title>
      <p>We first briefly review DSSM and then we provide the description of our model.
We then try to show the relationship between matrix factorization and our
approach.</p>
      <p>Fig. 1: Recurrent Attention DSSM Model Architecture
3.1</p>
      <p>Recurrent Attention DSSM (RA-DSSM)
In the MV-DNN, the input to the user view was merely the query history of
users. In this work, we modify the way in which inputs are sent to the user
view in order to adapt it specifically for the case of news recommendation. One
of the major issues in news recommendation is that of changing user interests.
Interests of users can be classified into short term as well as long term interests.
Hence, it becomes crucial for a news recommender to identify these interests and
recommend accordingly.
3.2</p>
      <p>
        Deep Semantic Structured Model
The Deep Semantic Structured Model (DSSM) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was proposed for the purpose
of ranking. Essentially, DSSM can be viewed as a multi-view learning model
that often composes of two or more neural networks for each individual view.
In the original two-view DSSM model, the network on the left side was meant
for query representation, whereas the networks on the right side were meant
for representing the documents. The input to these networks could be of any
arbitrary type like letter-tri-gram in the original paper or bag of unigrams used in
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. After that, each input vector goes through a non-linear transformation in the
feedforward neural network to output an embedding vector, which is smaller in
size than the input vector. The learning objective of the DSSM is to maximize the
cosine similarity between the two output vectors. For the purpose of training, a
set of positive examples and randomly sampled negative examples are generated
in order to minimize the cosine loss on positive examples. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], authors used
DSSM for recommendation where the first neural network contained the query
history of users and the second neural network contained the implicit feedback of
items (e.g News Clicks, App Downloads). The resulting model is named as
multiview DNN (MV-DNN) since it can incorporate item information from more than
one domain and jointly optimize them using the same loss function in DSSM.
3.3
      </p>
      <p>Recurrent Attention DSSM (RA-DSSM)
In the MV-DNN, the input to the user view was merely the query history of
users. In this work, we modify the way in which inputs are sent to the user
view in order to adapt it specifically for the case of news recommendation. One
of the major issues in news recommendation is that of changing user interests.
Interests of users can be classified into short term as well as long term interests.
Hence, it becomes crucial for a news recommender to identify these interests and
recommend accordingly.</p>
      <p>
        LSTMs have shown to be capable of learning long-term dependencies [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ][
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
Bidirectional LSTMs on the other hand can capture past and future information
effectively. Users interests keep changing over time and at the time of
recommendation we need to know the current interest and the long term user interest.
Using Bidirectional LSTMs as an encoder helps us to identify interests which
the user has taken up recently (short term) as well the long term interests of the
user. For each user, we have the sequence in which news articles were read by
her. We then choose the first R read articles for each user and use it as inputs
to our bidirectional LSTMs. The forward state updates of the LSTM satisfy the
following equations
      </p>
      <p>here is the logistic sigmoid function, !ft , !it , !ot represent the forget, input
and output gates respectively. !rt denotes the input at time t and !ht denotes
the latent state, !bt and !dt represent the bias terms. The forget, input and
output gates control the flow of information throughout the sequence. W! and
!V are matrices which represent the weights associated with the connections.
The backward states ( h 1; h 2; : : : ; h R) are computed in a similar manner as
above. The amount of reading history used as inputs to the bidirectional LSTM
is denoted by R. We then concatenate the forward and backward states to obtain
the annotations (h1; h2; : : : ; hR), where
hi =
"! #
h i
h i
!ft ; !it ; !ot =</p>
      <p>W! !h t 1; !rt + !b
!lt = tanh !V !h t 1; !rt + !d
!ct = !ft</p>
      <p>!
!c t 1 + !it lt
!ht = !ot tanh( !ct )</p>
      <p>
        We then need to identify the extent/granularity of each interests. Recently
in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], the effectiveness of attention mechanisms has been shown for the task of
neural machine translation. The goal of the attention mechanism in such tasks
is to derive a context vector that captures relevant source side information to
help predict the current target word. In our case, we want to use the sequence
of annotations generated by the encoder to come up with a context vector that
captures the extent of the user’s interests. Though, in a typical RNN
encoderdecoder framework [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], a context vector is generated at each time step to predict
the target word, in our case, we only need to calculate the context vector for a
single time step.
(1)
(2)
(3)
(4)
(5)
(6)
cattention =
      </p>
      <p>j hj
R
X
j=1
where, h1,. . . ,hR represents the sequence of annotations to which the encoder
maps the sequence of read news articles and each j represents the respective
weight corresponding to each annotation hj . The user view (left view) of the
model can be seen in Figure 1. The input to this is a selected amount of reading
history of each user. Each ri in the figure is a news embedding of dimension 300.</p>
      <p>
        However, the right view of the DSSM remains the same as can be seen in
Figure 1. For inputs to the right view of the DSSM, we select one positive sample
i.e an article that has been read by the user (apart from those that were used as
input to the user view) and n randomly selected negative samples (articles that
have not been read by the user). Each item+, item used as inputs to the item
view is also an embedding of size 300.
Typically in matrix factorization, to learn the model parameters, existing
pointwise methods [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ][
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] perform regression with a squared loss. This is based on
the assumption that observations are generated from a Gaussian distribution.
However, in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] it has been shown that such a method does not tally well when
we have implicit data available to us. Also, in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] it has been shown that a
ranking based objective function is more suitable for the task of recommendation.
Keeping these two aspects in mind, we adapt the loss function used in DSSM
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We first compute the posterior probability of a clicked news item given a
user from the relevance score using a softmax function
      </p>
      <p>P (item+ju) = P8item exp(R(u; item))</p>
      <p>exp(R(u; item+))
where u denotes the user, item+ denotes the item that was clicked by the user
and R represents the inner product function. We then maximize the likelihood
of the clicked news items given the user with the following loss function
L( ) =
log</p>
      <p>Y
u;item+</p>
      <p>P (item+ju)
where,</p>
      <p>represents the parameters of our model.
3.5</p>
      <p>Relation with Matrix Factorization
(7)
(8)
(9)
We now show how we could interpret our model as a special case of matrix
factorization, which is one of the most popular model for recommendation and
has been investigated extensively in literature.</p>
      <p>Matrix factorization models map both users and items to a joint latent factor
space of dimensionality f , such that user-item interactions are modeled as inner
products in that space. Accordingly, each item i is associated with a vector
qi 2 Rf and each user is associated with a vector pu 2 Rf . For a given item i,
the elements of qi measure the extent to which the item possesses those factors,
positive or negative. For a given user u, the elements of pu measure the extent of
interest the user has in items that are high on the corresponding factors, again,
positive or negative. The resulting dot product of the two vectors captures the
interaction between the user u and item i. This approximates the user u’s rating
for the item i, denoted by rui, leading to the estimate</p>
      <p>
        r^ui = qiT pu
The major challenge in this is to compute qi; pu 2 Rf . We solve this problem
by using deep neural networks. The deep neural architecture allows us to learn
a non-linear mapping for the users and the items to the same latent space. For
computing the mapping for the users, we first use a recurrent network followed
by an attention layer. Fully connected layers are then used for bringing in the
user and the items to the same latent space. In the final layer of the DSSM, we
compute the similarity between the user and the item using the dot product of
the non-linear mappings of the input vectors. The user can then be represented
as (u) and the item can be represented as (i) (here represents the learnt
non-linear mapping). Finally we estimate the rating as,
r^ui =
(i)T (u)
(10)
Although in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to compute this similarity the authors resorted to learn any
arbitrary function, we learn a non-linear transformation and then utilise the dot
product for computing the similarity.
4
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS</title>
      <p>We conduct experiments to answer the following questions:
1. Does our proposed model outperform the state-of-the-art implicit
collaborative methods? Also, how do the different variations of our model perform
for the given task.
2. How does our proposed model work for solving the item cold start problem?
3. How does our proposed model work for solving the user cold start problem?
4.1</p>
      <p>
        DATASET
For this work we use the dataset published by CLEF NewsREEL 2017. CLEF
NewsREEL provides an interaction platform to compare different news
recommender systems performance in an online as well as offline setting [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. As a
part of their evaluation for offline setting, CLEF shared a dataset which
captures interactions between users and news stories. It includes interactions of eight
different publishing sites in the month of February, 2016. The recorded stream of
events include 2 million notifications, 58 thousand item updates, and 168 million
recommendation requests. The dataset also provides other information like the
title and text of each news article, time of publication etc. Each user can be
identified by a unique id. For our task, we find out the sequence in which the
articles were read by the users. Along with this we also find out the content of
each of these read articles. Since, we rely only on implicit feedback we only need
to know whether the article was read by a user or not.
4.2
      </p>
      <p>
        Experimental Settings
As mentioned earlier, we use the dataset provided by CLEF NewsREEL 2017.
We extract the sequence in which the articles were read by the users. For each
article we concatenate the body and the text and use gensim [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to learn doc2vec
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] embeddings for those. The size of the embeddings is set to 300. In the given
dataset, almost 77% of the users have read less than 3 articles. We choose users
who have read in between 10-15 (inclusive) articles for training and testing our
model for item recommendation. The frequency of users who have read more
than 15 articles varies extensively and hence we restrict ourselves to the upper
bound of 15. We then choose users who have read 2-4 articles for testing our
model for the user cold start problem. For the item cold start problem, we again
test it on users who have read in between 10-15 articles. We ensure that the
chronology of the data is kept intact.
      </p>
      <p>
        Evaluation Protocol : To evaluate the performance of the recommended
item we use the leave-one-out evaluation strategy which has been widely adopted
in literature [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ][
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For each user we held-out her latest interaction as the
test set and utilized the remaining data for training. Since it is time consuming
to rank all items for every user during evaluation, we followed the common
strategy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ][
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] that randomly samples 100 items that are not interacted by the
user, ranking the test item among the 100 items. The performance of a ranked
list is judged by Hit Ratio (HR) and Normalized Discounted Cumulative gain
(NDCG) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Without special mention, we truncated the rank list at 10 for
both metrics. As such, the HR@k intuitively measures whether the test item is
present in the top-k list, and the NDCG accounts for the position of the hit by
assigning higher scores to hits at top ranks. We calculated both metrics for each
test user and reported the average score.
      </p>
      <p>
        Baselines : We compare our proposed approach with the following methods:
– ItemPop. News articles are ranked by their popularity judged by their
number of interactions. This is a non-personalized method to benchmark the
recommendation performance [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
– BPR [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This method optimizes the matrix factorization method with a
pairwise ranking loss, which is tailored to learn from implicit feedback. We
report the best performance obtained by fixing and varying the learning rate.
– eALS [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This is a state-of-the-art matrix factorization method for item
recommendation. It optimizes the squared loss (between actual item ratings
and predicted ratings) and treats all unobserved interactions as negative
instances and weighting them non-uniformly by item popularity.
– NeuMF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is a state-of-the-art neural matrix factorization model. It
treats the problem of generating recommendation using implicit feedback as
a binary classification problem. Consequently it uses the binary cross-entropy
loss to optimize its model parameters.
      </p>
      <p>For all the above methods we choose that number of predictive factors which
maximize the performance over our given dataset.</p>
      <p>
        Our proposed method is based on modelling user-item relationship, hence we
mainly compare it with other user-item models. We leave out the comparison
with other models like SLIM [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and CDAE [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] because these are item-item
models and hence performance difference may be caused by the user models for
personalization.
      </p>
      <p>
        Parameter Settings For the all the conducted experiments we use an Intel
i7-6700 CPU @ 3.40GHz which has a RAM of 32GB and a Tesla K40c GPU.
We ran all our experiments on the GPU. We implemented our proposed method
using Keras [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. As mentioned earlier, for each user who had read in between
10-15 (inclusive) articles we held out the last read article for our test set. We
then construct our training set as follows:
1. We first define the reading history that we want to use. We denote the
reading history by Rh.
2. For each user, we use Rh number of read articles as inputs to the user view.
      </p>
      <p>Leaving the last read article out, the remaining articles are used as positive
samples for the item view (right view) of the model.
3. For each positive instance of a user, we randomly sample n negative
instances(news items that the user has not interacted with) which are used as
inputs for the item view of the model. We experimentally set the number of
negative instances n to be 4.</p>
      <p>
        We then randomly divide the training set into training and validation set in a
4:1 ratio. This helps us to ensure that the two sets do not overlap. We tuned the
hyper-parameters of our model using the validation set. All the model and its
variants are learnt by optimizing the log loss of Equation 8. We initialise the fully
connected network weights with the uniform distribution in the range between
p6=(f anin + f anout) and p6=(f anin + f anout) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] . We used a batch size
of 256 and used adadelta [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as a gradient based optimizer for learning the
parameters of the model. Also, it is worth noticing that, just in the case of
NeuMF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where the size of the last layer of the deep network determines the
number of predictive factors, we can also treat the size of the last layer of our
network (just before computing the similarity) as the number of used predictive
factors.
      </p>
      <p>0:89
0:88
0:85
0:84
4
5
6
7
8
9
0</p>
      <p>0
0:2
0:1
0:6
0:5
0:4
0:2
0:1
0</p>
      <p>Cold News</p>
      <p>Cold User
1
2
3
4
5
6
7
8
9
10</p>
      <p>11</p>
      <p>K
0:1
0</p>
      <p>0
0:95
0:85
0:75
0:55
0:45
0:35
0</p>
      <p>Cold News</p>
      <p>Cold User
5</p>
      <p>6</p>
      <p>K
1
2
3
4
7
8
9
0:45
0:4
0:35
0
1
2
3
4
5
6
7
8
9
4.3</p>
      <p>Performance Comparison
Figure 2 and Figure 3 shows the performance of our model by varying the amount
of reading history used as inputs for the user side of RA-DSSM. Overall we see
that as we increase the amount of reading history used, the performance also
increases. This shows that a user has multiple interests which slowly get captured
as the number of articles used for the user view of RA-DSSM is increased.</p>
      <p>Interests of the user develop and vary with time and hence we also
experimented by concatenating the time at which the articles were read by each user
along with the article embeddings and used these as inputs to the model. It was
observed that there was no significant change in the performance. One of the
prime reasons for this could be that the model is able to encode the aspect of
time into itself given its sequential nature.</p>
      <p>Figure 4 and Figure 5 shows the performance of the Top-K recommended
lists where the ranking position K ranges from 1 to 10. We leave out the variants
of our own model here for comparison and only use the best performing model
i.e using RA-DSSM. As from the figure, it can be clearly seen that our model
shows consistent improvements over the other methods across all positions. The
reason for this can be attributed to the fact that apart from accounting for the
user’s general preferences we also account for the users changing interests and
the extent of those interests which the baselines do not incorporate directly.
We observe major improvements in the NDCG scores of our model. There is
an approximate 22% improvement over NeuMF. The reason for this is the loss
function of Equation 8 used by our model. The loss function which is optimized
for ranking, helps the model to recommend a better ranked list of items. For
baseline methods we see that eALS outperforms BPR with a margin of 2%. We
also note that ItemPop performs worst which indicates the need for modelling
user’s personalized preferences.</p>
      <p>We then evaluated our model for the cold start cases as can be seen in Figure
6 and Figure 7. For this task we segregated users who had read a new article in
the end i.e they read articles which had never been seen before they read it. We
found out that the number of such users were 74. Out of these 74 users, at an
HR@10 we observe that around 35% of the time we were able to recommend that
article. This promises us that our model is well suitable for handling the item
cold-start problem. For user cold-start, we test our learned model over users who
had read articles in between 2 to 4 (inclusive). The HR@10 score was around
50%. We see a gradual increase in the hit rates as we increase the value of k.
The results promise the efficiency of our model to handle the problem of user
cold start as well.</p>
      <p>
        We then note the effects on our model by varying the kind of recurrent
network used. We tested our model by using LSTMs, GRUs (Gated Recurrent Units)
[
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] and Vanilla RNN. From Figure 8 and Figure 9, the trend in the performance
can be observed as follows: LSTM &gt;GRU &gt;RNN. One of the reasons for this
could be the fact that an LSTM or a GRU is better able to encode the interests
of the user. In Table 1, we note the performance by adding bidirectional units
and attention layer to the LSTM. We note that Attention BiLSTM &gt;BiLSTM
&gt;LSTM. We also note that the attention does indeed enable us to capture the
extent of interests as it performs slightly better than the bidirectional LSTM.
5
      </p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION AND FUTURE</title>
    </sec>
    <sec id="sec-8">
      <title>WORK</title>
      <p>In this work we used deep neural networks for news recommendation. We
combined user-item collaborative filtering with the content of the read news articles
to come up with our model. We tackled the problem of changing and diverse
reading interests of users using a recurrent network combined with neural
attention. We also show the effectiveness of our model in solving the problem of
user cold-start and item cold-start as well. We then also show the effectiveness
of our model when using one-hot item encodings. This shows the adaptability
of our model for other recommendation scenarios which purely rely on implicit
feedback.</p>
      <p>In future, we would like to note the effects of learning an arbitrary function
instead of using the inner product to calculate the similarity between the user and
the item. We would also like to evaluate our model over different recommendation
scenarios. Apart from this, we would also like to explore the idea of reinforcement
based for news recommendation. Implicit feedback provided by the users could
be used to model their interests and recommend articles to them.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>He</surname>
          </string-name>
          , Xiangnan and Liao, Lizi and Zhang, Hanwang and Nie, Liqiang and Hu, Xia and Chua, Tat-Seng:
          <article-title>Neural Collaborative Filtering</article-title>
          .
          <source>In Proceedings of the 26th International Conference on World Wide Web, WWW '17</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andriy</given-names>
            <surname>Mnih</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Bayesian probabilistic matrix factorization using Markov chain Monte Carlo</article-title>
          .
          <source>In Proceedings of the 25th inter- national conference on Machine learning. ACM</source>
          ,
          <volume>880</volume>
          -
          <fpage>887</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jasson</surname>
            <given-names>DM</given-names>
          </string-name>
          <string-name>
            <surname>Rennie and Nathan Srebro</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Fast maximum margin matrix factorization for collaborative prediction</article-title>
          .
          <source>In Proceedings of the 22nd international conference on Machine learning. ACM</source>
          ,
          <volume>713</volume>
          -
          <fpage>719</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , Andriy Mnih, and
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Restricted Boltzmann machines for collaborative ltering</article-title>
          .
          <source>In Proceedings of the 24th international conference on Machine learning. ACM</source>
          ,
          <volume>791</volume>
          -
          <fpage>798</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Robert</surname>
            <given-names>M Bell</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Improved neighborhood-based collaborative ltering</article-title>
          .
          <source>In KDD cup and workshop at the 13th ACM SIGKDD international conference on knowledge discovery and data mining. Citeseer</source>
          ,
          <volume>7</volume>
          -
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Badrul</given-names>
            <surname>Sarwar</surname>
          </string-name>
          , George Karypis, Joseph Konstan,
          <string-name>
            <given-names>and John</given-names>
            <surname>Riedl</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Item-based collaborative filltering recommendation algorithms</article-title>
          .
          <source>In Proceedings of the 10th international conference on World Wide Web. ACM</source>
          ,
          <volume>285</volume>
          -
          <fpage>295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Jiahui</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Dolan</surname>
          </string-name>
          , and Elin Rønby Pedersen.
          <year>2010</year>
          .
          <article-title>Personalized news recommendation based on click behavior</article-title>
          .
          <source>In Proceedings of the 15th international conference on Intelligent user interfaces. ACM</source>
          ,
          <volume>31</volume>
          -
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Po-Sen</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Xiaodong He,
          <string-name>
            <surname>Jianfeng Gao</surname>
            , Li Deng,
            <given-names>Alex</given-names>
          </string-name>
          <string-name>
            <surname>Acero</surname>
            , and
            <given-names>Larry</given-names>
          </string-name>
          <string-name>
            <surname>Heck</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Learning deep structured semantic models for web search using clickthrough data</article-title>
          .
          <source>In Proceedings of the 22nd ACM international conference on Conference on information &amp; knowledge management. ACM</source>
          ,
          <volume>2333</volume>
          -
          <fpage>2338</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Ali</given-names>
            <surname>Mamdouh</surname>
          </string-name>
          <string-name>
            <given-names>Elkahky</given-names>
            ,
            <surname>Yang Song</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xiaodong</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A multi-view deep learning approach for cross domain user modeling in recommendation systems</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web. ACM</source>
          ,
          <volume>278</volume>
          -
          <fpage>288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Joonseok</surname>
            <given-names>Lee</given-names>
          </string-name>
          , Samy Bengio, Seungyeon Kim, Guy Lebanon, and
          <string-name>
            <given-names>Yoram</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Local collaborative ranking</article-title>
          .
          <source>In Proceedings of the 23rd international conference on World wide web. ACM</source>
          ,
          <volume>85</volume>
          -
          <fpage>96</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Yehuda</given-names>
            <surname>Koren</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Factorization meets the neighborhood: a multifaceted collaborative filtering model</article-title>
          .
          <source>In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM</source>
          ,
          <volume>426</volume>
          -
          <fpage>434</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Radim</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA</source>
          , Valletta, Malta,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . http://is.muni.cz/publication/ 884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Xiangnan</surname>
            <given-names>He</given-names>
          </string-name>
          , Hanwang Zhang,
          <string-name>
            <surname>Min-Yen Kan</surname>
          </string-name>
          , and
          <string-name>
            <surname>Tat-Seng Chua</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Fast matrix factorization for online recommendation with implicit feedback</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>549</volume>
          -
          <fpage>558</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ste</surname>
          </string-name>
          en Rendle, Christoph Freudenthaler, Zeno Gantner, and
          <string-name>
            <surname>Lars</surname>
          </string-name>
          Schmidt- Thieme.
          <year>2009</year>
          .
          <article-title>BPR: Bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence</source>
          . AUAI Press,
          <fpage>452</fpage>
          -
          <lpage>461</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xiangnan</surname>
            <given-names>He</given-names>
          </string-name>
          , Tao Chen,
          <string-name>
            <surname>Min-Yen Kan</surname>
            , and
            <given-names>Xiao</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Trirank: Reviewaware explainable recommendation by modeling aspects</article-title>
          .
          <source>In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM</source>
          ,
          <volume>1661</volume>
          -
          <fpage>1670</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andriy</given-names>
            <surname>Mnih</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Probabilistic Matrix Factorization.</article-title>
          .
          <source>In Nips</source>
          , Vol.
          <volume>1</volume>
          . 2-
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Matthew</surname>
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Zeiler</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>ADADELTA: an adaptive learning rate method</article-title>
          .
          <source>arXiv preprint arXiv:1212.5701</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Yao</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christopher</surname>
            <given-names>DuBois</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alice X Zheng</surname>
            , and
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Ester</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Collaborative denoising auto-encoders for top-n recommender systems</article-title>
          .
          <source>In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM</source>
          ,
          <volume>153</volume>
          -
          <fpage>162</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Florian</given-names>
            <surname>Strub</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jeremie</given-names>
            <surname>Mary</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Collaborative Filtering with Stacked Denoising AutoEncoders and Sparse Inputs</article-title>
          .
          <source>In NIPS Workshop on Machine Learning for eCommerce.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. François Chollet and others.
          <source>2015</source>
          . Keras. https://github.com/fchollet/keras. (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>Xia</given-names>
            <surname>Ning</surname>
          </string-name>
          and
          <string-name>
            <given-names>George</given-names>
            <surname>Karypis</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Slim: Sparse linear methods for top-n recommender systems</article-title>
          .
          <source>In Data Mining (ICDM)</source>
          ,
          <source>2011 IEEE 11th International Conference on. IEEE</source>
          ,
          <fpage>497</fpage>
          -
          <lpage>506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Dinh</surname>
            <given-names>Q Phung</given-names>
          </string-name>
          ,
          <article-title>Svetha Venkatesh, and</article-title>
          <string-name>
            <given-names>others. 2009. Ordinal</given-names>
            <surname>Boltzmann</surname>
          </string-name>
          ma
          <article-title>- chines for collaborative ltering</article-title>
          .
          <source>In Proceedings of the Twenty- fth Conference on Uncertainty in Arti cial Intelligence</source>
          . AUAI Press,
          <fpage>548</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Suvash</surname>
            <given-names>Sedhain</given-names>
          </string-name>
          , Aditya Krishna Menon,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Sanner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Lexing</given-names>
            <surname>Xie</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Autorec: Autoencoders meet collaborative ltering</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web. ACM</source>
          ,
          <volume>111</volume>
          -
          <fpage>112</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Minmin</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Zhixiang Xu,
          <string-name>
            <given-names>Fei</given-names>
            <surname>Sha</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kilian Q Weinberger</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Marginalized Denoising Autoencoders for Domain Adaptation</article-title>
          .
          <source>In Proceedings of the 29th International Conference on Machine Learning (ICML-12)</source>
          .
          <fpage>767</fpage>
          -
          <lpage>774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Dzmitry</surname>
            <given-names>Bahdanau</given-names>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural ma- chine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Immanuel</surname>
            <given-names>Bayer</given-names>
          </string-name>
          , Xiangnan He,
          <string-name>
            <surname>Bhargav Kanagal</surname>
          </string-name>
          , and Ste en Rendle.
          <year>2017</year>
          .
          <article-title>A Generic Coordinate Descent Framework for Learning from Implicit Feedback</article-title>
          .
          <source>In Proceedings of the 26th International Conference on World Wide Web (WWW '17).</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <given-names>Quoc</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed representations of sentences and documents</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Machine Learning (ICML-14)</source>
          .
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Glorot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Understanding the di culty of training deep feedforward neural networks.</article-title>
          .
          <source>In Aistats</source>
          , Vol.
          <volume>9</volume>
          .
          <fpage>249</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jürgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9</source>
          ,
          <issue>8</issue>
          (
          <year>1997</year>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>3104</volume>
          -
          <fpage>3112</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <given-names>Mike</given-names>
            <surname>Schuster</surname>
          </string-name>
          and
          <article-title>Kuldip</article-title>
          K Paliwal.
          <year>1997</year>
          .
          <article-title>Bidirectional recurrent neural networks</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          <volume>45</volume>
          ,
          <issue>11</issue>
          (
          <year>1997</year>
          ),
          <fpage>2673</fpage>
          -
          <lpage>2681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Frank</surname>
            <given-names>Hopfgartner</given-names>
          </string-name>
          , Torben Brodt, Jonas Seiler, Benjamin Kille, Andreas Lommatzsch, Martha Larson, Roberto Turrin, and
          <string-name>
            <given-names>András</given-names>
            <surname>Serény</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Benchmarking news recommendations: The clef newsreel use case</article-title>
          .
          <source>In ACM SIGIR Forum</source>
          , Vol.
          <volume>49</volume>
          . ACM,
          <volume>129</volume>
          -
          <fpage>136</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Junyoung</surname>
            <given-names>Chung</given-names>
          </string-name>
          , Caglar Gulcehre, KyungHyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>
          .
          <source>arXiv preprint arXiv:1412.3555</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>