=Paper=
{{Paper
|id=Vol-2079/paper10
|storemode=property
|title=Neural Content-Collaborative Filtering for News Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2079/paper10.pdf
|volume=Vol-2079
|authors=Dhruv Khattar,Vaibhav Kumar,Manish Gupta,Vasudeva Varma
|dblpUrl=https://dblp.org/rec/conf/ecir/KhattarK0V18
}}
==Neural Content-Collaborative Filtering for News Recommendation==
Neural Content-Collaborative Filtering for News Recommendation Dhruv Khattar, Vaibhav Kumar∗, Manish Gupta†, Vasudeva Varma Information Retrieval and Extraction Laboratory International Institute of Information Technology Hyderabad dhruv.khattar, vaibhav.kumar@research.iiit.ac.in, manish.gupta, vv@iiit.ac.in content. Amongst the various approaches for collab- orative filtering, matrix factorization (MF) (Kor08) Abstract is the most popular one. However, it requires a considerable amount of previous history of interaction Popular methods like collaborative filtering before it can provide high quality recommendations. and content-based filtering have their own dis- It also drastically suffers from the problem of item advantages. The former method requires a cold-start, handling which is very crucial for news considerable amount of user data before mak- recommendation. ing predictions, while the latter, suffers from Another common approach is content-based recom- over-specialization. In this work, we address mendation, which recommends based on the level of both of these issues by coming up with a hy- similarity between user and item feature/profile. Al- brid approach based on neural networks for though it can handle item cold-start, it suffers from the news recommendation. The hybrid approach problem of over-specialization. Both, CF and content- incorporates for both (1) user-item interac- based cannot directly adapt to the temporal changes tion and (2) content-information of the arti- in users interests. cles read by the user in the past. We first come In general, a news recommender should handle item up with an article-embedding based profile for cold start very well due to the overwhelming amount the user. We then use this user profile with ad- of articles published each day. It should also be able to equate positive and negative samples in order adapt to the temporal changes in the users interests. to train the neural network based model. The In case of news, the content of the news article and the resulting model is then applied on a real-world preference of a user act as the most important signals dataset. We compare it with a set of estab- for news recommendation. In order to do this, we come lished baselines and the experimental results up with a hybrid approach for recommendation. show that our model outperforms the state- Our model consists of two components. For the of-the-art. first component, we utilize the sequence in which the articles were read by the user and come up with a user 1 Introduction profile. We do this as follows: A popular approach to the task of recommen- dation is called collaborative filtering (CF) 1. First, we learn the doc2vec (Le14) embeddings for (Bel07)(Ren05)(Sal07) which uses the user’s past each news article by combining the title and text interaction with the item to predict the most relevant of each article. ∗ Author had equal contribution. 2. We then choose a specific amount of reading his- † The author is also an applied researcher at Microsoft. tory for all the users. Copyright c 2018 for the individual papers by the papers’ au- thors. Copying permitted for private and academic purposes. 3. Finally, we combine the doc2vec embeddings of This volume is published and copyrighted by its editors. each of the articles present in the user history us- In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez, B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18 ing certain heuristics which preserves the tempo- Workshop at ECIR, Grenoble, France, 26-March-2018, pub- ral information encoded in the sequence of articles lished at http://ceur-ws.org read by the user. The second component then captures the similarity feature space. Then similarity scores could be com- between the user profile and the candidate articles puted between users and items. The recommendation by first computing an element-wise product between is made based on the similarity scores of a user towards their representations followed by fully connected hid- all the items. The Content-based Filtering methods den layers. Finally, the output of a logistic unit is usually perform well when users have plenty of histor- used to make predictions. We pose the problem of ical records for learning. news recommendation as that of binary classification Hybrid of CF and Content-based Filtering in order to learn the parameters of the model. We As a first attempt to unify Collaborative Filtering only rely on the implicit feedback provided by the and Content-based Filtering, (Basilico and Hofmann user. The first component enables us to understand 2004) proposed to learn a kernel or similarity function the user preferences and model the temporal changes between the user-item pairs that allows simultaneous in their interest thereby giving us the advantages of generalization across either user or item dimensions. a content-based recommendation system. While, the This approach would do well when the user-item rating second component models the user-item interaction in matrix is dense (Bas04). However in most current rec- a manner similar to that of matrix factorization giv- ommender system settings, the data is rather sparse, ing us the advantages of a collaborative filtering based which would make this method fail. recommender system. Neural Network based approaches Early pi- To summarize, the contributions of the work are as oneer work which used neural network was done in follows: (Sal07), where a two-layer Restricted Boltzmann Ma- chine (RBM) is used to model users’ explicit rat- 1. We use doc2vec embeddings of each news article ings on items. Recently autoencoders have become a in order to come up with user profiles for each user popular choice for building recommendation systems which encapsulates information about the chang- (Che12)(Sed15)(Str15). In terms of user personaliza- ing interests of the user over time. tion, this approaches shares a similar spirit as the item- 2. We use a deep neural architecture for news rec- item model (Nin11)(Sar01)(Kum17) that represents a ommendation in which we utilize the user-item user using features related to her rated items. While interaction as well as the content of the news. previous work has lent support for addressing collabo- rative filtering, most of them have focused on observed 3. We pose the problem of recommendation as that ratings and modeled observed data only. As a result, of binary classification in order to learn the pa- they can easily fail to learn users’ preferences accu- rameters of the model by only using the implicit rately from the positive-only implicit data. However, feedback provided by the users. all these models are based on either user-user or item- item interaction whereas our method is based on user- 4. We perform experiments to show the effectiveness item interaction. Hence, we leave out comparison with of our model for the problem of news recommen- such methods as there might be differences caused due dation. to user personalization. Implicit Feedback Implicit Feedback originated 2 Related Work from the area of information retrieval and the related There has been a lot of work on recommender sys- techniques have been successfully applied in the do- tems with a myriad of publications. In this section we main of recommender systems (Kel03)(Oar98). The attempt to review work that is closely associated to implicit feedbacks are usually inferred from user be- ours. haviors, such as browsing items, marking items as Collaborative Filtering Collaborative Filtering favourite, etc. Intuitively, the implicit feedback ap- is an approach of making automatic prediction (filter- proach is based on the assumption that the implicit ing) about the interests of a user by collecting inter- feedbacks could be used to regularize or supplement ests from many related users. Some of the best results the explicit training data. are obtained based on matrix factorization techniques (Kor09). Collaborative Filtering methods are usually 3 Dataset adopted when the historical records for training are scarce. For this work we use the dataset published by CLEF Content-based Filtering Content-based recom- NewsREEL 2017. CLEF NewsREEL provides an in- mender systems try to recommend items simi- teraction platform to compare different news recom- lar to those a given user has liked in the past mender systems performance in an online as well as (Lop11)(Sai14)(Kum17). The common approach is to offline setting (Hop16). As a part of their evaluation represent both the users and the items under the same for offline setting, CLEF shared a dataset which cap- Figure 1: Model Architecture tures interactions between users and news stories. It profile. includes interactions of eight different publishing sites R 1 X in the month of February, 2016. The recorded stream U= rh (1) R of events include 2 million notifications, 58 thousand h=1 item updates, and 168 million recommendation re- 2. Discounting quests. The dataset also provides other information like the title and text of each news article, time of pub- In this we first discount each of the vectors present lication etc. Each user can be identified by a unique in the user reading history by a power of 2 such id. For our task, we needed to find out the sequence in that an article read at time t − 1 carries half the which the articles were read by the users along with its weight compared to an article read at time t. We content. Since, we rely on implicit feedback we only then take an average of all the vectors. need to know whether an article was read by a user or R not. 1 X rh U= (2) R 2R−h h=1 4 Model Architecture 3. Exponential Discounting In this section we briefly provide the description of our model. We first discuss user profiling, followed by In this we discount each of the vectors present in the neural network architecture. We then provide the the user reading history by a power of e such that training criteria for our model. an article read at time t − 1 carries 1/e the weight compared to an article read at time t. We then take an average of all the vectors. 4.1 User Profiling R The overview of this can be seen from Figure 1(A). We 1 X rh U= (3) first define a set of notations useful in understanding R eR−h h=1 the creation of user profile. We define the number of articles in the user reading history to be R. The Using such a method, allows us to understand the pref- doc2vec embeddings of each article in the history is erences of the user based on the content of the articles represented by rh where 1 ≤ h ≤ R. Each vector is of read by the user. It also helps us to understand the size 300. The user profile for a user is denoted by U . temporal changes in users interests. We now discuss three kinds of operations using which we create the user profiles. 4.2 Neural Network Architecture After the user profile is obtained, we then perform an 1. Centroid element-wise product between the profile and the em- In this method, we find the centroid of the embed- bedding of the candidate article as can be seen from dings of the articles present in the reading history Figure 1(B). These candidate articles are basically the of the user. The centroid then represents the user positive and the negative samples used for training the 0.8 0.6 model. We then feed the element-wise product as in- 0.7 0.5 puts to a hidden layers of size 128. This is then fol- 0.6 lowed by two subsequent fully connected hidden layers 0.4 NDCG@K 0.5 HR@K of sizes 64 and 32. Finally we use the logistic unit to 0.4 0.3 make predictions. A careful reader might have noticed 0.3 0.2 0.2 that, such an architecture gives us the capability to 0.1 0.1 learn an arbitrary similarity function instead of tradi- 0 0 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 tional metrics such as cosine similarity etc. which has Our Model K Key-VSM U2U Our Model K Key-VSM U2U been normally used for calculating relevance. Typi- I2I Word SVD ItemPop I2I Word SVD ItemPop cally, in matrix factorization, in order to make predic- tions, a dot product between the user and the item representation is computed i.e uT q where u is the user Figure 2: Performance of our model vs some state-of- representation and q is the item representation. How- the-art models ever, in our case we compute aout (ht (φ(u) φ(i)) where aout and h represent the activation function (logistic K Avg Discounting Exponential function) and the edge weights of the output layer and HR NDCG HR NDCG HR NDCG φ(u), φ(i) represent non-linear transformation for user 1 0.319 0.319 0.258 0.258 0.237 0.237 and item respectively. An astute reader might notice 2 0.478 0.419 0.404 0.350 0.384 0.330 that, if we use an identity function for aout and en- 3 0.568 0.464 0.506 0.4013 0.483 0.379 force h to be a uniform vector of 1, we will be able to 4 0.624 0.489 0.573 0.430 0.550 0.408 recover the Matrix Factorization model. Hence, using 5 0.664 0.504 0.619 0.447 0.595 0.426 such an architecture helps us to retain the advantages 6 0.696 0.515 0.654 0.460 0.631 0.439 of collaborative filtering associated with news recom- 7 0.718 0.522 0.678 0.468 0.658 0.448 mendation. 8 0.737 0.529 0.696 0.474 0.680 0.454 9 0.754 0.533 0.712 0.478 0.697 0.460 4.3 Training 10 0.768 0.538 0.724 0.482 0.713 0.464 Since we only utilize the implicit feedback of users available at our disposal, we pose the problem of rec- Table 1: Performance with different user profiles ommendation as that of binary classification where la- bel 1 would mean highly recommended and 0 would Baselines: We compare our method with several mean not recommended. We use the binary cross en- others. First we look at item popularity based method tropy loss, also known as log loss, to learn the param- (ItemPop). In this we recommend the most popu- eters of the model. lar items to the user. We then evaluate User-to-User (U2U-KNN) and Item-to-Item (I2I-KNN) by setting the neighbourhood size to 80. We then compare it with 5 Experiments Singular Value Decomposition (SVD). We also imple- As mentioned earlier we use the data provided by ment Word Embeddings based Recommendations as CLEF NewsReel 2017. We choose users who have in (Mus16) and Keyword based Vector Space Model read in between 8-15 (inclusive) articles for training (Key-VSM) as mentioned in (Lop11). and testing our model for item recommendation. The Parameter Settings: We implemented our pro- frequency of users who have read more than 15 arti- posed model using Keras (Cho15). We then construct cles varies extensively and hence we restrict ourselves our training set as follows: to the upper bound of 15. We set the lower bound to 8 since we need some history in order to capture 1. We first define the reading history. We denote the the changing user interests. However, for future work reading history by h. we would like to investigate how changing the lower 2. Leaving the latest article read by each user, the bound affects the performance of our model. remaining articles are used as positive samples. Evaluation Protocol: For each user we held-out her latest interaction as the test set and utilized the 3. Corresponding to each positive sample, we ran- remaining data for training. We then recommend a domly sample 4 negative instances (articles which ranked list of articles to each user. The performance the user did not read). of a ranked list is judged by Hit Ratio (HR) and Nor- malized Discounted Cumulative gain (NDCG). With- We then randomly divide the training set into training out special mention we truncate the ranked list at 10 and validation set in a 4:1 ratio. This helps us to for both metrics. ensure that the two sets do not overlap. We tuned the 0.8 0.55 depicted in Figure 3. We see that choosing a size of 12 0.75 performs the best when using the averaging method for 0.5 profiling. While for the other two, a size of 8 performs NDCG@10 HR@10 0.7 the best. We then also experiment with the number 0.45 of negative samples for training the model parameters. 0.65 Average Discounting Average Discounting From Figure 4, we can see that increasing the number 0.6 Exponential Discounting 0.4 Exponential Discounting of negative samples improves the performance of the 8 10 12 14 8 10 12 14 Reading History Reading History model but only up to a certain point, after which the performance of the model deteriorates. Figure 3: Performance of our model w.r.t Reading his- We also evaluate the model on item cold-start and tory of user find out that our model achieves an HR@10 score of around 0.32. While the typical collaborative filtering 0.8 0.55 models would fail to do, using content vectors for ar- 0.75 0.5 ticles provides our model the flexibility to account for these cases as well. NDCG@10 HR@10 0.7 0.45 7 Conclusion and Future Work 0.65 Average 0.4 Average Discounting Exponential Discounting Discounting Exponential Discounting In this work, we come up with a neural model for con- 0.6 0 1 2 3 4 5 6 0.35 0 1 2 3 4 5 6 tent collaborative filtering for news recommendations Number of negatives Number of negatives which incorporates both the user-item interaction pat- tern as well as the content of the news articles read by the user in the past. In future, we would like to explore Figure 4: Performance of our model w.r.t number of more on deep recurrent models for user profiling. negative samples hyper-parameters of our model using the validation Acknowledgement set. We use a batch size of 256. We thank Kartik Gupta of Data Science and Analytics Centre at International Institute of Information Tech- nology Hyderabad for helping us in making a presen- 6 Results tation of this work. From Figure 2 we can see the results of our model as compared with the baselines. Our model outper- References forms the baselines by a significant margin in terms of [Bas04] Basilico, Justin, and Thomas Hofmann. both HR and NDCG across all positions. This clearly ”Unifying collaborative and content-based shows the effectiveness of our model in understanding filtering.” Proceedings of the twenty-first in- the user preferences and making predictions accord- ternational conference on Machine learning. ingly. Further it can be clearly noticed that U2U, ACM, 2004. APA I2I and SVD do not perform well. One reason for this could be the sparsity of the data. In presence of [Bel07] Bell, Robert M., and Yehuda Koren. ”Im- sparse data these methods fail to capture relevant in- proved neighborhood-based collaborative fil- formation. The low performance of Word Embedding tering.” KDD cup and workshop at the 13th based Recommendations suggests that a representa- ACM SIGKDD international conference on tion of words alone is not effective in profiling the user. knowledge discovery and data mining. sn, The model also outperforms Key-VSM (Lop11) which 2007. APA suggests the effectiveness of the user profile component used in our model. [Che12] Chen, Minmin, et al. ”Marginalized denois- In Table 1, we compare the results obtained by us- ing autoencoders for domain adaptation.” ing different sorts of profiling method. The trend in the arXiv preprint arXiv:1206.4683 (2012). performance can be seen as follows : Avg >Discount- ing >Exponential. This suggests that all the articles [Cho15] Chollet, Franois. ”Keras.” (2015): 128. read by the user in a particular window have some im- portance in predicting the article that the user would [Hop16] Hopfgartner, Frank, et al. ”Benchmarking be reading next. news recommendations: The clef newsreel Further we experiment on the size of reading history use case.” ACM SIGIR Forum. Vol. 49. No. used as inputs to our model, the results for which are 2. ACM, 2016. [Kel03] Kelly, Diane, and Jaime Teevan. ”Implicit [Nin11] Ning, Xia, and George Karypis. ”Slim: feedback for inferring user preference: a bib- Sparse linear methods for top-n recom- liography.” Acm Sigir Forum. Vol. 37. No. 2. mender systems.” Data Mining (ICDM), ACM, 2003. 2011 IEEE 11th International Conference on. IEEE, 2011. [Kha17] Khattar, Dhruv, Vaibhav Kumar, and Va- sudeva Varma. ”Leveraging Moderate User [Oar98] Oard, Douglas W., and Jinmook Kim. ”Im- Data for News Recommendation.” Data plicit feedback for recommender systems.” Mining Workshops (ICDMW), 2017 IEEE Proceedings of the AAAI workshop on rec- International Conference on. IEEE, 2017. ommender systems. Vol. 83. WoUongong, 1998. [Kor08] Koren, Yehuda. ”Factorization meets the neighborhood: a multifaceted collabora- [Ren05] Rennie, Jasson DM, and Nathan Srebro. tive filtering model.” Proceedings of the ”Fast maximum margin matrix factorization 14th ACM SIGKDD international confer- for collaborative prediction.” Proceedings of ence on Knowledge discovery and data min- the 22nd international conference on Ma- ing. ACM, 2008. chine learning. ACM, 2005. [Kor09] Koren, Yehuda, Robert Bell, and Chris [Sai14] Saia, Roberto, Ludovico Boratto, and Sal- Volinsky. ”Matrix factorization techniques vatore Carta. ”Semantic Coherence-based for recommender systems.” Computer 42.8 User Profile Modeling in the Recommender (2009). Systems Context.” KDIR. 2014. [Sal07] Salakhutdinov, Ruslan, Andriy Mnih, and [Kum17] Kumar, Vaibhav, et al. ”Word Semantics Geoffrey Hinton. ”Restricted Boltzmann based 3-D Convolutional Neural Networks machines for collaborative filtering.” Pro- for News Recommendation.” 2017 IEEE ceedings of the 24th international conference International Conference on Data Mining on Machine learning. ACM, 2007. Workshops (ICDMW). IEEE, 2017. [Sar01] Sarwar, Badrul, et al. ”Item-based col- [Kum17] Kumar, Vaibhav, et al. ”Deep Neural Archi- laborative filtering recommendation algo- tecture for News Recommendation.” Work- rithms.” Proceedings of the 10th inter- ing Notes of the 8th International Confer- national conference on World Wide Web. ence of the CLEF Initiative, Dublin, Ireland. ACM, 2001. CEUR Workshop Proceedings. 2017. [Sed15] Sedhain, Suvash, et al. ”Autorec: Autoen- [Kum17] Kumar, Vaibhav, et al. ”User Profiling coders meet collaborative filtering.” Pro- Based Deep Neural Network for Tempo- ceedings of the 24th International Confer- ral News Recommendation.” Data Mining ence on World Wide Web. ACM, 2015. Workshops (ICDMW), 2017 IEEE Interna- tional Conference on. IEEE, 2017. [Str15] Strub, Florian, and Jeremie Mary. ”Collab- orative filtering with stacked denoising au- [Le14] Le, Quoc, and Tomas Mikolov. ”Dis- toencoders and sparse inputs.” NIPS work- tributed representations of sentences and shop on machine learning for eCommerce. documents.” International Conference on 2015. Machine Learning. 2014. [Lop11] Lops, Pasquale, Marco De Gemmis, and Giovanni Semeraro. ”Content-based recom- mender systems: State of the art and trends.” Recommender systems handbook. Springer, Boston, MA, 2011. 73-105. [Mus16] Musto, Cataldo, et al. ”Learning word em- beddings from wikipedia for content-based recommender systems.” European Confer- ence on Information Retrieval. Springer, Cham, 2016.