RARE : A Recurrent Attentive Recommendation Engine for News Aggregators Dhruv Khattar, Vaibhav Kumar∗, Shashank Gupta, Manish Gupta†, Vasudeva Varma International Institute of Information Technology Hyderabad {dhruv.khattar, vaibhav.kumar, shashank.gupta}@research.iiit.ac.in the user’s generic as well as specific interests. We carry out extensive experiments over three Abstract real-world datasets and show that RARE out- performs the state-of-the-art. Furthermore, With news stories coming from a variety of we also demonstrate the effectiveness of our sources, it is crucial for news aggregators to method in handling the cold start cases. present interesting articles to the user to max- imize their engagement. This creates the need 1 Introduction for a news recommendation system which un- derstands the content of the articles as well as A news aggregator collects news from a variety of accounts for the users’ preferences. Methods sources and presents it to the user. It would be quite such as Collaborative Filtering, which are well cumbersome for a user to select articles of her choice known for general recommendations, are not from a huge list of presented articles which may per- suitable for news because of the short life span tain to a variety of subjects. Hence, it becomes crucial of articles and because of the large number of for such aggregators to have a recommendation system articles published each day. Apart from this, to point the user to the most relevant items and thus such methods do not harness the information maximize her engagement with the site and minimize present in the sequence in which the articles the time needed to find relevant content. are read by the user and hence are unable to A popular approach to the task of recommendation account for the specific and generic interests of is collaborative filtering Bell and Koren (2007); Rennie the user which may keep changing with time. and Srebro (2005); Salakhutdinov et al. (2007), which In order to address these issues for news rec- uses the user’s past interaction with the item to pre- ommendation, we propose the Recurrent At- dict the most relevant content. Another common ap- tentive Recommendation Engine (RARE). proach is content-based recommendations, which uses RARE consists of two components and utilizes features between items and/or users to recommend the distributed representations of news arti- new items to the users based on the similarity between cles. The first component is used to model the features. However, amongst the various approaches user’s sequential behaviour of news reading in for collaborative filtering, Matrix Factorization (MF) order to understand her general interests, i.e., Koren (2008), is the most popular one, which projects to get a summary of her interests. The second users and items into a shared latent space, using a vec- component utilizes an article level attention tor of latent features to represent a user or an item. mechanism to understand her specific inter- Thereafter, a user’s interaction with an item is mod- ests. We feed the information obtained from elled as the inner product of their latent vectors. both the components to a Siamese Network However, Collaborative Filtering methods are not in order to make predictions which pertain to suitable for news recommendation because news arti- cles have a short life span and expire quickly Zhong ∗ Author had equal contribution. He can also be contacted et al. (2015). Such methods also require a consid- at vaibhav2@andrew.cmu.edu † Author is also a Principal Applied Researcher at Microsoft. erable number of interactions with an item (article) Copyright © CIKM 2018 for the individual papers by the papers' before making predictions which is not desirable for authors. Copyright © CIKM 2018 for the volume as a collection news recommendation because we would ideally want by its editors. This volume and its papers are published under to start recommending articles as soon as they are the Creative Commons License Attribution 4.0 International (CC BY 4.0). published. Also, they do not directly harness the in- formation present in the sequence in which the arti- cles were read by the user and hence fail to account for the generic as well as specific interests of the user which may keep changing with time. In order to ad- dress these issues it becomes crucial to understand the content of the news articles as well as the user’s pref- erences. We explain this through an example in the following paragraph. As can be seen from Fig. 1(A), if a user reads four different articles belonging to tennis and football, then we would like our model to infer that the generic in- terests of the users lie in reading articles about sports. Hence, this would allow articles belonging to different topics in the sports category to be recommended to Figure 1: In (A), the user’s sequence is used to model the user. However, since the user reads more articles her general interests. While in (B), the user’s specific on tennis rather than football, we would like to give interests are captured. In (C), the changing interests more weight to the articles related to tennis as can be of the user are modelled. In all these cases, sequen- seen in Fig. 1(B). Hence, in our overall list of recom- tial reading history of a user plays an important role. mended articles to the user, we would like to present Different colors represent the different topics of the news articles related to sports amongst which articles article. related to tennis would be given more importance. It Using such a network enhances the model with further may also happen that the user suddenly starts read- non-linearity and enables it to capture the user-article ing articles related to business rather than sports. In interaction in a better sense. It also allows the model such a case we may also want to start recommending to learn an arbitrary similarity function instead of the articles related to business as well. This can be seen in traditional metrics. The distributed representation of Fig. 1(C). However, it is important to note that in all each news article is used as input to our model. This these cases the sequential reading history of the user gives us the capability to recommend articles as and is very important while generating recommendations. when they are produced, without depending on any To encode this intuition, we propose a novel neural prior user interaction with that article. network framework namely Recurrent Attentive Rec- To summarize, the main contributions of this work ommendation Engine (RARE). As illustrated in Fig. 3, are as follows. RARE consists of two components. The first compo- nent is based on a recurrent neural network and uses • We present a neural network based architecture the sequential reading history of the user as its in- (RARE) with the following capabilities. put. We call this the generic encoder. This helps us – It utilizes the content of the news articles to identify the generic/overall interests of the users, giving it the ability to recommend articles i.e., it provides a summary of the user’s interests. The as soon as they are published. second component utilizes a recurrent neural network with an attention mechanism to identify the specific – It takes into account the users’ generic as interests of the user. We call this the specific encoder. well as specific interests. The part dealing with attention allows the model to at- – It adapts to the changing interests of the tend to articles in a differential manner, discriminating user. the more from the less important ones. We then con- • We carry out extensive experiments over three catenate the representations obtained from both these real world datasets to show the effectiveness of components and call it the unified representation of the our model. The results reveal that our method users’ interests. Limiting the size of the user reading outperforms the state-of-the-art. history used as inputs to both these components allows us to adapt to the changing user preferences. We then • We show the effectiveness of our model for solving feed this unified representation along with the repre- the cold-start cases as well. sentation of the candidate article to a Siamese Net- work and compute an element wise product between the outputs obtained at the final layer of the sister 2 Related Work networks, as illustrated in Fig. 2. Finally, we use a lo- There has been extensive study on recommendation gistic unit to compute the score for recommendation. systems with a myriad of publications. In this section, we aim at reviewing a representative set of approaches. 2.1 Common Approaches for Recommenda- tion Systems Recommendation systems in general can be di- vided into collaborative recommendation systems and content-based recommendation systems. In collabora- tive filtering based recommendations, an item is rec- ommended to a user if similar users liked that item. Collaborative filtering can be further divided into user collaborative filtering, item collaborative filtering or a hybrid of both user and item collaborative filter- Figure 2: RARE Model Architecture ing. Examples of such techniques include Bayesian ma- trix factorization Salakhutdinov and Mnih (2008), ma- 2.2 Neural Recommendation Systems trix completion Rennie and Srebro (2005), Restricted Early work which used neural networks Salakhutdinov Boltzmann Machine Salakhutdinov et al. (2007), near- et al. (2007) used a two-layer Restricted Boltzmann est neighbour modelling Bell and Koren (2007). In Machine (RBM) to model users’ explicit ratings on user collaborative methods such as Bell and Koren items. The work has been later extended to model (2007), the algorithm first computes similarity be- the ordinal nature of ratings Phung et al. (2009). Re- tween every pair of users based on the items liked by cently auto-encoders have become a popular choice for them. Then, the scores of user-item pairs are com- building recommendation systems Chen et al. (2012); puted by combining scores of this item given by sim- Sedhain et al. (2015); Strub and Mary (2015). The ilar users. Item-based collaborative filtering Sarwar idea of user-based AutoRec Sedhain et al. (2015) is to et al. (2001), computes similarity between items based learn hidden structures that can reconstruct a user’s on the users who like both items. It then recommends ratings given her historical ratings as inputs. In terms items to the user based on the items she has previously of user personalization, this approach shares a similar liked. Finally, in user-item based collaborative filter- spirit as the item-item model Ning and Karypis (2011); ing, both the users and the items are projected into a Sarwar et al. (2001) that represents a user in terms of common vector space based on the user-item matrix her rated item features. While previous work has lent and then the item and user representation are com- support for addressing collaborative filtering, most of bined to find a recommendation. Matrix factorization them have focused on observed ratings and modeled based approaches like Rennie and Srebro (2005) and the observed data only. As a result, they can easily Salakhutdinov and Mnih (2008) are examples of such fail to learn users’ preferences from the positive-only a technique. One of the major drawbacks of collabo- implicit data. rative filtering is its inability to handle new users and In Wu et al. (2016) a collaborative denoising auto- new items, a problem which is often referred to as the encoder (CDAE) for CF with implicit feedback is pre- cold-start issue. sented. In contrast to the DAE-based CF Strub and Another common approach for recommendation is Mary (2015), CDAE additionally plugs a user node content-based recommendation. In this approach, fea- to the input of auto-encoders for reconstructing the tures from user’s profile and/or item’s description are user’s ratings. As shown by the authors, CDAE is extracted and are used for recommending items to equivalent to the SVD++ model Koren (2008) when users. The underlying assumption is that the users the identity function is applied to activate the hidden tend to like items that they liked previously. In Liu layers of CDAE. Although CDAE is a collaborative et al. (2010), each user is modeled by a distribution filtering model, it is solely based on item-item interac- over news topics that is constructed from articles she tion whereas the work which we present here is based liked with a prior distribution of topic preferences com- on user-item interaction. On the other hand in He puted using all users who share the same location. A et al. (2017), authors have explored deep neural net- major advantage of using content-based recommenda- works for recommendation systems. They present a tion is that it can handle the problem of item cold-start general framework named NCF, short for Neural Col- as it uses item features for recommendation. For user laborative Filtering, that replaces the inner product cold-start, a variety of other features like age, location, with a neural architecture that can learn an arbitrary popularity aspects could be used. In the following we function from the given data. It uses a multi-layer discuss previous work on neural approaches for recom- perceptron to learn the user-item interaction function. mendation systems. NCF is able to express and generalize matrix factoriza- Figure 3: Two Components of RARE: Generic Encoder and Specific Encoder tion. They then combine the linearity of matrix factor- scription of the various components of the proposed ization and non-linearity of deep neural networks for RARE model. modelling user-item latent structures. They call this model as NeuMF, short for Neural Matrix Factoriza- 3.1 Task Description tion. Given a series of news articles read by the user, our Since our work also involves projecting articles and task is to recommend articles of interest to the user. users to a common geometric space, we review the The implicit feedback provided by the user is avail- work in Huang et al. (2013). They propose an effective able to us, i.e., we have information about the articles approach for projecting queries and documents into a clicked by the user. Apart from this, we also have the common low-dimensional space. The model is named content of the news articles available at our disposal. as Deep Structured Semantic Model (DSSM) Huang We first select a reading history of size R for each et al. (2013) and is effective in calculating the rele- user. The size of the reading history determines the vance of the document given a query by computing number of past interactions we use for making predic- the distance between them. Originally this model was tions. The articles previously read by a user can be meant for the purpose of ranking, but since the prob- represented as [r1 , r2 , ..., rt , ..., rR ] where 1 ≤ t ≤ R. lem of ranking has very close associations with that Using this list as inputs to our model we need to rec- of recommendation, DSSM was later extended to rec- ommend a ranked list of articles which are aligned with ommendation scenarios in Elkahky et al. (2015). In the users’ interests. Elkahky et al. (2015), the authors designed a DSSM such that the first neural network contains user’s query 3.2 RARE Overview history (and thus referred to as the user view) and the second neural network contains implicit feedback of We propose a novel Recurrent Attentive Recommenda- items. The resulting model is named multi-view DNN tion Engine (RARE) to address the problem of news (MV-DNN) since it can incorporate item information recommendation for news aggregators. An overview from more than one domain and then jointly optimize of our method can be seen in Fig. 2. The basic idea all of them using the same loss function in DSSM. of RARE is to build a unified representation of a However, in Elkahky et al. (2015), the features for the user’s interests which encapsulates both her specific users were their search queries and features for items and generic interests. Apart from this, using a specific came from multiple sources (e.g., Apps, Movies/TV amount of reading history of a user provides RARE etc.). This makes it less adaptable by a news website with the flexibility to adapt to the changing interests as it requires a lot of information outside the news do- of the user. The pipeline of RARE can be described main. However, if the work is viewed in its entirety, as follows. it suggests that supercharging a neural network with • We first learn a distributed representation for each non-linearities to project a user and an item to the news article by combining its title and text. same geometric space is very effective in calculating relevance. We draw the inspiration for using Siamese • We then fix a reading history size R, and use the network in our model on similar grounds. representations of the previous R articles read by the user as inputs to the model. 3 Model Architecture • We come up with a unified representation of the In this section we first introduce the news article rec- users’ interests using recurrent neural networks ommendation task and then provide an elaborate de- with an attention mechanism. • Treating the unified representation of the user as a Here σ is the logistic sigmoid function. ft , it , ot query and the representation of the candidate ar- represent the forget, input and output gates respec- ticle as a document, we use a Siamese network to tively. rt denotes the input at time t and ht denotes make them undergo similar transformations and the latent state. Wf , Wi , Wo and V represent the supercharge them with non-linearities to discover weight parameters respectively, while bf , bi , bo and d user-item interactions. represent the bias parameters respectively. The forget, input and output gates control the flow of information 3.3 Distributed Representation for News Ar- throughout the sequence. ticles We learn a 300-dimension distributed representa- 3.5 Specific Encoder tion Le and Mikolov (2014) for each news article by The architecture of Specific Encoder is similar to that combining the title and text of the news articles. of the Generic Encoder. The graphical representation Learning such a representation allows us to for this can be seen in Fig. 3(b). We use LSTM cells • Capture the overall semantics of the news article. here as well. To capture the specific interests of the users, i.e., to understand the deeper interests of the • Enables the model to come up with a represen- user within her broader interests, we use an article tation for new news articles as well as of articles level attention mechanism. This provides us with a with varying lengths. context vector which encapsulates the specific interests of the user. This can be represented as, News articles generally follow an inverted pyramid structure where the title and the first paragraph give R away the desired information. Hence, we only choose cs = X αj hj (7) the title and the first paragraph because it usually con- j=1 tains all the relevant information without delving into detailed explanations. We also experimented by choos- where the attention weights, αj , control the part of ing the entire news article but found better results with the input sequence which should be emphasized or ig- just the first paragraph. nored and hj stands for the output of the hidden units. This attention mechanism gives RARE the capability 3.4 Generic Encoder to adaptively focus more on the important items. The inputs for the generic encoder are the represen- tations of the articles previously read by the user. 3.6 RARE Fig. 3(a) shows the graphical model of the network The complete architecture of the proposed model can used to identify generic interests in RARE. We use Re- be seen in Fig. 4. The outputs obtained from the current Neural Network (RNN) with Long-Short Term specific and the generic encoder are concatenated and Memory (LSTM) cells. LSTMs have been shown to be then used as inputs to a Siamese network along with capable of learning long-term dependencies Hochreiter the candidate article. and Schmidhuber (1997); Sutskever et al. (2014). The For the given task, the generic encoder captures the aim of this component is to understand the generic overall interests of the user, i.e., it captures the sum- (broader/overall) interests of the user. The last hid- mary of the entire news articles read by the user. At den state of the RNN, i.e., ht encapsulates this infor- the same time, the specific encoder adaptively selects mation, which we represent as cg . We can think of the the important articles to capture the specific interests final hidden state as the overall summary of the user’s of the user. Hence to take advantage of both kinds interests. of information we concatenate the outputs of both the The state updates of the LSTM satisfy the following encoders. equations. As shown in Fig. 3, we can see that hgt is incor-     porated into cu to provide the summarized user in- ft = σ Wf ht−1 , rt + bf (1)     terests. Note that different encoding mechanisms will it = σ Wi ht−1 , rt + bi (2) be invoked in both the encoders when trained jointly.     The last hidden state of the generic encoder hgt plays ot = σ Wo ht−1 , rt + bo (3)     a different role from that of hst . The former has the lt = tanh V ht−1 , rt + d (4) responsibility to encode the information present in the sequence in which the articles were read by the ct = ft · ct−1 + it · lt (5) user. While the latter is used for computing attention ht = ot · tanh(ct ) (6) weights. Information obtained from both the encoders Figure 4: Complete Architecture of the RARE System is utilized to come up with a unified representation of 3.7 Learning users’ interests. Typically, to learn the model parameters, existing R point-wise methods Salakhutdinov and Mnih (2007) perform regression with a squared loss. This is based g X u g s s c = [c ; c ] = [ht ; αj hj ] (8) j=1 on the assumption that observations are generated from a Gaussian distribution. However, in He et al. u where c represents the unified representation of users’ (2017) it has been shown that such a method is not interests. very effective when we have implicit data available. u We then use c as inputs to one of the sister net- Given a user u and an article x, let yˆux represent works in the Siamese network as shown in Fig. 3. The the predicted score at the output layer. Training is input to the other sister network is the learned repre- performed by minimizing the point-wise loss between sentation of the candidate article. The Siamese net- y ˆ ux and its target value yux . Considering the one-class work supercharges RARE with further non-linearities nature of implicit feedback, we can view the value of and makes the user representation and the article rep- y ux as a label 1 meaning the item x is relevant to a resentation go through similar transformations. In user u, and 0 otherwise. The prediction score yˆux then Huang et al. (2013), an architecture similar to that of a represents how likely an item x is relevant to u. Hence Siamese network has been used for ranking documents in order to constrain the values between 0 and 1, we with respect to a query with great effectiveness. If use the logistic function. We then define the likelihood we try to draw a parallel between the query-document function as follows. problem with our task, one can see that a query in our case is cu and the document is the representation Y Y p(γ + , γ − |I, Θm ) = yˆui (1 − yˆuj ) (9) of the candidate news article. Hence, it seems apt to (u,i)∈γ + (u,j)∈γ − use such a network if we were to project both of these into the same geometric space to uncover the underly- where γ + and γ − represent the positive (observed in- ing user-article interaction pattern. A similar sort of teractions) and negative (unobserved interactions) ar- technique has also been used by authors in He et al. ticles respectively. I represents the input and Θm rep- (2017) for modelling user-item interactions. Final pre- resents the parameters of the model. The negative log dictions are obtained from the Siamese network after likelihood can then be written as follows (after rear- the logistic on the element-wise product between the ranging the terms). outputs obtained from the sister networks. Rather than using Siamese networks, the other X L=− yui log yˆui +(1−yui )(1−log yˆui ) (10) choice was to use a typical encoder-decoder framework. u,i∈γ + ∪γ − However, a typical encoder-decoder framework is un- able to produce out-of-vocabulary (OOV) words. In The loss is similar to binary cross-entropy and can be the news recommendation problem setting, each new minimized using gradient descent methods. published article, that has not been interacted by any It is also worth noticing that the likelihood func- user would act as an “OOV word”. However, it is very tion is such that it simultaneously adjusts the model’s crucial for a news recommender to recommend articles parameters by maximizing the score of the relevant as soon they are published which is why we resort to articles and at the same time adjusts to minimize the such a method as it allows us to handle such cases well. score of the non-relevant articles. This is similar to 1 0.7 0.8 0.6 0.6 0.5 0.8 0.6 0.5 0.4 NDCG@K NDCG@K 0.6 HR@K HR@K 0.4 0.4 0.3 0.4 0.3 RARE RARE RARE RARE NeuMF NeuMF NeuMF 0.2 NeuMF 0.2 eALS eALS 0.2 eALS eALS 0.2 BPR 0.1 BPR BPR 0.1 BPR ItemPop ItemPop ItemPop ItemPop 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 K K K K Figure 5: Performance of RARE vs state-of-the-art on Figure 6: Performance of RARE vs state-of-the-art on CLEF NewsREEL Indonesian Dataset what is done while ranking documents corresponding • BPR Rendle et al. (2009). This method uses the to a query in Huang et al. (2013). Using such a like- matrix factorization method with a pairwise rank- lihood also gives us the advantages of a ranking func- ing loss, which is tailored to learn to rank from im- tion. plicit feedback. We report the best performance obtained by fixing and varying the learning rate. 4 Experiments • eALS He et al. (2016). This is a state-of-the-art In this section, we describe the datasets, the state-of- matrix factorization method for item recommen- the-art methods, evaluation protocol along with the dation. It optimizes the squared loss (between ac- settings used for learning the parameters of the model. tual item ratings and predicted ratings) and treats all unobserved interactions as negative instances 4.1 Dataset and weighting them non-uniformly by item popu- We use three real world datasets for evaluation. First, larity. we use the dataset published by CLEF NewsREEL • NeuMF He et al. (2017). This is a state-of-the- 2017 Hopfgartner et al. (2016). CLEF shared a dataset art neural matrix factorization model. It treats which captures interactions between users and news the problem of generating recommendations us- stories. It includes interactions of eight different pub- ing implicit feedback as a binary classification lishing sites in the month of February 2016. The problem. Consequently it uses the binary cross- recorded stream of events include 2 million notifica- entropy loss to optimize its model parameters. tions, 58 thousand item updates, and 168 million rec- ommendation requests. It also includes information Our method is based on user-item interactions, like the title and text of each news article. For this hence we mainly compare it with other user-item mod- dataset we considered all the users who had read more els. We leave out the comparison with other models than 10 articles after which we get a total of 22229 like SLIM Ning and Karypis (2011) and CDAE Wu users. The other two datasets are provided by a pop- et al. (2016) because these are item-item models and ular news aggregation website (name omitted for re- hence performance difference may be caused by the view). The second dataset contains a list of articles user models for personalization. read by 10297 users in an Indian language, Malay- alam. The third dataset contains a list of articles read 4.3 Evaluation Protocol by 22848 users in Indonesian. We make the code pub- licly available 1 . To evaluate the performance of the recommended item we use the leave-one-out evaluation strategy which has 4.2 Baselines been widely adopted in literature Bayer et al. (2017); He et al. (2016); Rendle et al. (2009). For each user We compare our proposed approach with the following we held-out her latest interaction as the test instance methods. and utilized the remaining data for training. Since it is time consuming to rank all items for every user during • ItemPop. News articles are ranked by their pop- evaluation, we followed the popular strategy Elkahky ularity judged by their number of interactions. et al. (2015); Koren (2008) that randomly samples 100 This is a non-personalized method to benchmark items that the user has not interacted with, ranking the recommendation performance Rendle et al. the test item among the 100 items. The performance of (2009). a ranked list is judged by Hit Ratio (HR) and Normal- 1 https://github.com/dhruvkhattar/RARE ized Discounted Cumulative Gain (NDCG) He et al. 1 0.8 0.7 0.9 0.8 0.6 0.8 0.6 NDCG@K NDCG@K 0.6 HR@K HR@K 0.4 0.7 0.4 RARE RARE 0.5 0.6 NeuMF NeuMF eALS 0.2 eALS LSTM LSTM 0.2 0.5 BPR BPR RNN RNN ItemPop ItemPop GRU 0.4 GRU 0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 K K K K Figure 7: Performance of RARE vs state-of-the-art on Figure 9: Performance of RARE w.r.t Recurrent Unit Malayalam Dataset used in RARE on NewsREEL 5.1 Performance Comparison with Baselines 0.94 0.68 For MF based methods like BPR and eALS, the num- 0.67 0.93 ber of predictive factors chosen is equal to the number NDCG@10 0.66 of latent factors. We report the best performance in HR@10 0.92 0.65 this case. For NeuMF, we vary the size of the CF lay- 0.91 ers (also latent factors) to choose the best fit for our 0.64 model. 0.9 8 10 12 14 0.63 8 10 12 14 In Figs. 5 to 6, we compare our method with the Reading History Reading History baselines. Note that the performance of ItemPop mea- sure was very weak and hence it does not show up Figure 8: Performance of RARE w.r.t. the User’s clearly in the graphs. The Top-K recommended lists Reading History on NewsREEL are used where K varies from 1 to 10. It is very (2015). We truncated the rank list at 10 for both clear from Figs. 5 and 7 that RARE outperforms other the metrics. As such, the HR@k intuitively measures methods by a significant margin across all positions on whether the test item is present in the top-k list, and the NewsREEL and the Malayalam datasets respec- the NDCG accounts for the position of the hit by as- tively. Although, RARE outperforms the other meth- signing higher scores to hits at top ranks. We calcu- ods in case of Indonesian dataset as well (Fig. 6) but lated both metrics for each test user and reported the the margin is not that large. Amongst the different average score. baselines, the trend in the performance can be seen as follows: NeuMF >eALS >BPR (in terms of both HR and NDCG). Although, in Rendle et al. (2009) it 4.4 Parameter Learning has been shown that BPR can be a strong performer for ranking performance owing to its pairwise rank- We use an Intel i7-6700 CPU @ 3.40GHz which has ing aware learner, we did not see the trend for our a RAM of 32GB and a Tesla K40c GPU. We im- datasets. On the other hand RARE outperforms all plemented our proposed method using Keras Chol- the other baselines in terms of NDCG as well. let et al. (2015). We randomly divide the labeled set into training and validation set in a 4:1 ratio. 5.2 Effect of Size of Reading History We tuned the hyper-parameters of our model us- ing the validation set. The proposed model and all We vary the size of the reading history R used as inputs its variants are learned by optimizing the log likeli- to our model. From Fig. 5, one can see that the Hit Ra- hood given by Eq. 10. We initialize the fully con- tio slowly increases with the size of the reading history nected network weights with p the uniform distribu- until a certain point after which it decreases. However, tion in the range between − 6/(f anin + f anout ) and the NDCG keeps on increasing. We can attribute this behaviour to the fact that users have diversified read- p 6/(f anin + f anout ) Glorot and Bengio (2010). We used a batch size of 256 and used AdaDelta Zeiler ing interests which only get effectively captured after a (2012) as the optimizer. substantial number of interactions have been observed. However, after a while, increasing the user history of- ten leads to over-specialization where the generic in- 5 Results and Analysis terests tend to overpower the specific ones. This is also an indicator of the fact that the preference of a user In this section we present the results obtained by car- keeps varying and hence a window size should be cho- rying different experiments with our method. sen such that it helps the model to dynamically adapt 0.6 0.4 Method HR@10 NDCG@10 Specific Encoder 0.916 0.657 0.5 0.3 Generic Encoder 0.920 0.664 NDCG@K 0.4 HR@K Specific+Generic(RARE) 0.934 0.671 0.2 0.3 Table 1: Performance using different Encoding Mech- 0.2 Cold News 0.1 Cold News anism on CLEF NewsREEL Cold User Cold User 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 K K Layers HR@10 NDCG@10 128 0.913 0.659 128→64 0.934 0.671 Figure 10: Performance of our model on Cold-Start 128→64→32 0.912 0.666 cases few such users in the other two datasets. Out of these Table 2: Performance of RARE by changing number 74 users, we see that the HR@10 is around 0.35. This of dense layers promises us that our model is well suitable for handling to the users changing behaviour. the item cold-start problem. For all our methods, we chose a reading history of For user cold-start, we test our learned model over 12 for the users. We needed to make a choice between users who had read articles in between 2 to 4 (inclu- 12 and 14, and we chose 12 because we gave more sive) over the same dataset. Since we set the history importance to the HR rather than the NDCG. size to 12, we had to set the remaining inputs to 0s. The HR@10 score was around 0.5. We see a gradual 5.3 Effect of different Encoders increase in the hit rates as we increase the value of K. The results promise the effectiveness of our model We first note the effects on RARE by varying the kind to handle the problem of user cold start as well. Al- of recurrent network used. We tested our model by though this is not exactly the user cold start problem using LSTMs, GRUs (Gated Recurrent Units) Chung because it still considers some number of user interac- et al. (2014) and Vanilla RNN. From Fig. 9, the trend tions, still it is worth noticing the performance because in the performance can be observed as follows: LSTM the baselines need a considerable amount of user his- >GRU >RNN although the differences are not very tory before making predictions. On the other hand, in large. One of the reasons for this could be the fact our method, we can simply use the trained model for that an LSTM or a GRU is better able to encode the recommending articles to users who have had very few interests of the user as they handle long-term depen- interactions. dencies better. We also note the effects when using different vari- 5.5 Effect of Varying Layers ants of our own model, i.e., when we replace the unified representation in RARE with solely the specific or the We observe the performance of our model when we generic encoder. The results for this can be seen from vary the number of layers used in the Siamese Network Table 1. We note the trend in performance as follows in our model. We experiment by varying the number RARE >Generic Encoder >Specific Encoder. This in- of layers along with the number of hidden units. We dicates that merely identifying the users’ generic inter- experiment by using one layer with size 128, two layers ests (a summary of overall interests) is not sufficient with sizes 128 and 64 and three layers with sizes 128, for learning a good recommendation model. However, 64 and 32. From Table 2, we can see that the best when we use a combination of both in RARE, we find performance is observed in the second case. that the recommendation performance improves which clearly indicates that identifying both the specific and 6 Conclusion generic interests are essential for better recommenda- In this paper, we proposed the Recurrent Attentive tions. News Recommendation Engine (RARE) to address the problem of news recommendation. We attempt to en- 5.4 Performance on Cold Start Cases code both the generic and the specific interests of the We then evaluated our model for the cold start cases users. For the former we use a recurrent neural net- as can be seen in Fig. 10. For this task we segregated work while for the latter we use a recurrent network users who had read a new news article in the end, i.e., with an attention mechanism. We use the unified rep- they read articles which had never been seen before resentations obtained from both these along with a they read it. We found out that the number of such Siamese network to make predictions. We conducted users were 74 in the CLEF dataset. There were very extensive experiments on three real-world datasets and demonstrated that our method can outperform the Frank Hopfgartner, Torben Brodt, Jonas Seiler, Ben- state-of-the-art methods in terms of different evalu- jamin Kille, Andreas Lommatzsch, Martha Larson, ation metrics. Roberto Turrin, and András Serény. 2016. Bench- marking News Recommendations: The CLEF News- References REEL Use Case. In ACM SIGIR Forum, Vol. 49. 129–136. Immanuel Bayer, Xiangnan He, Bhargav Kanagal, and Steffen Rendle. 2017. A Generic Coordinate Descent Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Framework for Learning from Implicit Feedback. In Alex Acero, and Larry Heck. 2013. Learning Deep WWW. Structured Semantic Models for Web Search using Clickthrough Data. In CIKM. 2333–2338. Robert M Bell and Yehuda Koren. 2007. Im- proved Neighborhood-based Collaborative Filtering. Yehuda Koren. 2008. Factorization Meets the Neigh- In KDD. 7–14. borhood: A Multifaceted Collaborative Filtering Model. In KDD. 426–434. Minmin Chen, Zhixiang Xu, Fei Sha, and Kilian Q Weinberger. 2012. Marginalized Denoising Autoen- Quoc Le and Tomas Mikolov. 2014. Distributed Rep- coders for Domain Adaptation. In ICML. 767–774. resentations of Sentences and Documents. In ICML. 1188–1196. François Chollet et al. 2015. Keras. https://github.com/fchollet/keras. Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized news recommendation based Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, on click behavior. In Proceedings of the 15th in- and Yoshua Bengio. 2014. Empirical Evaluation ternational conference on Intelligent user interfaces. of Gated Recurrent Neural Networks on Sequence ACM, 31–40. Modeling. arXiv preprint arXiv:1412.3555 (2014). Xia Ning and George Karypis. 2011. Slim: Sparse Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. Linear Methods for Top-n Recommender Systems. 2015. A Multi-View Deep Learning Approach for In ICDM. 497–506. Cross Domain User Modeling in Recommendation Dinh Q Phung, Svetha Venkatesh, et al. 2009. Ordinal Systems. In WWW. 278–288. Boltzmann Machines for Collaborative Filtering. In Xavier Glorot and Yoshua Bengio. 2010. Understand- UAI. 548–556. ing the Difficulty of Training Deep Feed-forward Steffen Rendle, Christoph Freudenthaler, Zeno Gant- Neural Networks. In AI-Stats, Vol. 9. 249–256. ner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feed- Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao back. In Proceedings of the twenty-fifth conference Chen. 2015. Trirank: Review-aware Explainable on uncertainty in artificial intelligence. AUAI Press, Recommendation by Modeling Aspects. In CIKM. 452–461. 1661–1670. Jasson DM Rennie and Nathan Srebro. 2005. Fast Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, maximum margin matrix factorization for collabora- Xia Hu, and Tat-Seng Chua. 2017. Neural Collab- tive prediction. In Proceedings of the 22nd interna- orative Filtering. In Proceedings of the 26th Inter- tional conference on Machine learning. ACM, 713– national Conference on World Wide Web (WWW 719. ’17). Ruslan Salakhutdinov and Andriy Mnih. 2007. Prob- Xiangnan He, Hanwang Zhang, Min-Yen Kan, and abilistic Matrix Factorization. In NIPS, Vol. 1. Tat-Seng Chua. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Ruslan Salakhutdinov and Andriy Mnih. 2008. Proceedings of the 39th International ACM SIGIR Bayesian probabilistic matrix factorization using conference on Research and Development in Infor- Markov chain Monte Carlo. In Proceedings of the mation Retrieval. ACM, 549–558. 25th international conference on Machine learning. ACM, 880–887. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comp. 9, 8 (1997), Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey 1735–1780. Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th in- ternational conference on Machine learning. ACM, 791–798. Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web. ACM, 285–295. Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th In- ternational Conference on World Wide Web. ACM, 111–112. Florian Strub and Jeremie Mary. 2015. Collaborative Filtering with Stacked Denoising AutoEncoders and Sparse Inputs. In NIPS Workshop on ML for eCom- merce. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to Sequence Learning with Neural Net- works. In NIPS. 3104–3112. Yao Wu, Christopher DuBois, Alice X Zheng, and Martin Ester. 2016. Collaborative Denoising Auto- Encoders for Top-n Recommender Systems. In WSDM. 153–162. Matthew D Zeiler. 2012. ADADELTA: An Adap- tive Learning Rate Method. arXiv preprint arXiv:1212.5701 (2012). Erheng Zhong, Nathan Liu, Yue Shi, and Suju Ra- jan. 2015. Building discriminative user profiles for large-scale content recommendation. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2277–2286.