=Paper=
{{Paper
|id=Vol-2079/paper10
|storemode=property
|title=Neural Content-Collaborative Filtering for News Recommendation
|pdfUrl=https://ceur-ws.org/Vol-2079/paper10.pdf
|volume=Vol-2079
|authors=Dhruv Khattar,Vaibhav Kumar,Manish Gupta,Vasudeva Varma
|dblpUrl=https://dblp.org/rec/conf/ecir/KhattarK0V18
}}
==Neural Content-Collaborative Filtering for News Recommendation==
<pdf width="1500px">https://ceur-ws.org/Vol-2079/paper10.pdf</pdf>
<pre>
         Neural Content-Collaborative Filtering for News
                       Recommendation

                 Dhruv Khattar, Vaibhav Kumar∗, Manish Gupta†, Vasudeva Varma
                         Information Retrieval and Extraction Laboratory
                   International Institute of Information Technology Hyderabad
            dhruv.khattar, vaibhav.kumar@research.iiit.ac.in, manish.gupta, vv@iiit.ac.in


                                                                content. Amongst the various approaches for collab-
                                                                orative filtering, matrix factorization (MF) (Kor08)
                        Abstract                                is the most popular one. However, it requires a
                                                                considerable amount of previous history of interaction
    Popular methods like collaborative filtering                before it can provide high quality recommendations.
    and content-based filtering have their own dis-             It also drastically suffers from the problem of item
    advantages. The former method requires a                    cold-start, handling which is very crucial for news
    considerable amount of user data before mak-                recommendation.
    ing predictions, while the latter, suffers from                 Another common approach is content-based recom-
    over-specialization. In this work, we address               mendation, which recommends based on the level of
    both of these issues by coming up with a hy-                similarity between user and item feature/profile. Al-
    brid approach based on neural networks for                  though it can handle item cold-start, it suffers from the
    news recommendation. The hybrid approach                    problem of over-specialization. Both, CF and content-
    incorporates for both (1) user-item interac-                based cannot directly adapt to the temporal changes
    tion and (2) content-information of the arti-               in users interests.
    cles read by the user in the past. We first come
                                                                    In general, a news recommender should handle item
    up with an article-embedding based profile for
                                                                cold start very well due to the overwhelming amount
    the user. We then use this user profile with ad-
                                                                of articles published each day. It should also be able to
    equate positive and negative samples in order
                                                                adapt to the temporal changes in the users interests.
    to train the neural network based model. The
                                                                In case of news, the content of the news article and the
    resulting model is then applied on a real-world
                                                                preference of a user act as the most important signals
    dataset. We compare it with a set of estab-
                                                                for news recommendation. In order to do this, we come
    lished baselines and the experimental results
                                                                up with a hybrid approach for recommendation.
    show that our model outperforms the state-
                                                                    Our model consists of two components. For the
    of-the-art.
                                                                first component, we utilize the sequence in which the
                                                                articles were read by the user and come up with a user
1    Introduction                                               profile. We do this as follows:
A popular approach to the task of recommen-
dation is called collaborative filtering (CF)                    1. First, we learn the doc2vec (Le14) embeddings for
(Bel07)(Ren05)(Sal07) which uses the user’s past                    each news article by combining the title and text
interaction with the item to predict the most relevant              of each article.
    ∗
    Author had equal contribution.                               2. We then choose a specific amount of reading his-
    †
    The author is also an applied researcher at Microsoft.          tory for all the users.
Copyright c 2018 for the individual papers by the papers’ au-
thors. Copying permitted for private and academic purposes.      3. Finally, we combine the doc2vec embeddings of
This volume is published and copyrighted by its editors.
                                                                    each of the articles present in the user history us-
In: D. Albakour, D. Corney, J. Gonzalo, M. Martinez,
B. Poblete, A. Vlachos (eds.): Proceedings of the NewsIR’18
                                                                    ing certain heuristics which preserves the tempo-
Workshop at ECIR, Grenoble, France, 26-March-2018, pub-             ral information encoded in the sequence of articles
lished at http://ceur-ws.org                                        read by the user.
The second component then captures the similarity             feature space. Then similarity scores could be com-
between the user profile and the candidate articles           puted between users and items. The recommendation
by first computing an element-wise product between            is made based on the similarity scores of a user towards
their representations followed by fully connected hid-        all the items. The Content-based Filtering methods
den layers. Finally, the output of a logistic unit is         usually perform well when users have plenty of histor-
used to make predictions. We pose the problem of              ical records for learning.
news recommendation as that of binary classification             Hybrid of CF and Content-based Filtering
in order to learn the parameters of the model. We             As a first attempt to unify Collaborative Filtering
only rely on the implicit feedback provided by the            and Content-based Filtering, (Basilico and Hofmann
user. The first component enables us to understand            2004) proposed to learn a kernel or similarity function
the user preferences and model the temporal changes           between the user-item pairs that allows simultaneous
in their interest thereby giving us the advantages of         generalization across either user or item dimensions.
a content-based recommendation system. While, the             This approach would do well when the user-item rating
second component models the user-item interaction in          matrix is dense (Bas04). However in most current rec-
a manner similar to that of matrix factorization giv-         ommender system settings, the data is rather sparse,
ing us the advantages of a collaborative filtering based      which would make this method fail.
recommender system.                                              Neural Network based approaches Early pi-
   To summarize, the contributions of the work are as         oneer work which used neural network was done in
follows:                                                      (Sal07), where a two-layer Restricted Boltzmann Ma-
                                                              chine (RBM) is used to model users’ explicit rat-
    1. We use doc2vec embeddings of each news article
                                                              ings on items. Recently autoencoders have become a
       in order to come up with user profiles for each user
                                                              popular choice for building recommendation systems
       which encapsulates information about the chang-
                                                              (Che12)(Sed15)(Str15). In terms of user personaliza-
       ing interests of the user over time.
                                                              tion, this approaches shares a similar spirit as the item-
    2. We use a deep neural architecture for news rec-        item model (Nin11)(Sar01)(Kum17) that represents a
       ommendation in which we utilize the user-item          user using features related to her rated items. While
       interaction as well as the content of the news.        previous work has lent support for addressing collabo-
                                                              rative filtering, most of them have focused on observed
    3. We pose the problem of recommendation as that          ratings and modeled observed data only. As a result,
       of binary classification in order to learn the pa-     they can easily fail to learn users’ preferences accu-
       rameters of the model by only using the implicit       rately from the positive-only implicit data. However,
       feedback provided by the users.                        all these models are based on either user-user or item-
                                                              item interaction whereas our method is based on user-
    4. We perform experiments to show the effectiveness
                                                              item interaction. Hence, we leave out comparison with
       of our model for the problem of news recommen-
                                                              such methods as there might be differences caused due
       dation.
                                                              to user personalization.
                                                                 Implicit Feedback Implicit Feedback originated
2      Related Work
                                                              from the area of information retrieval and the related
There has been a lot of work on recommender sys-              techniques have been successfully applied in the do-
tems with a myriad of publications. In this section we        main of recommender systems (Kel03)(Oar98). The
attempt to review work that is closely associated to          implicit feedbacks are usually inferred from user be-
ours.                                                         haviors, such as browsing items, marking items as
   Collaborative Filtering Collaborative Filtering            favourite, etc. Intuitively, the implicit feedback ap-
is an approach of making automatic prediction (filter-        proach is based on the assumption that the implicit
ing) about the interests of a user by collecting inter-       feedbacks could be used to regularize or supplement
ests from many related users. Some of the best results        the explicit training data.
are obtained based on matrix factorization techniques
(Kor09). Collaborative Filtering methods are usually          3    Dataset
adopted when the historical records for training are
scarce.                                                       For this work we use the dataset published by CLEF
   Content-based Filtering Content-based recom-               NewsREEL 2017. CLEF NewsREEL provides an in-
mender systems try to recommend items simi-                   teraction platform to compare different news recom-
lar to those a given user has liked in the past               mender systems performance in an online as well as
(Lop11)(Sai14)(Kum17). The common approach is to              offline setting (Hop16). As a part of their evaluation
represent both the users and the items under the same         for offline setting, CLEF shared a dataset which cap-
                                            Figure 1: Model Architecture

tures interactions between users and news stories. It              profile.
includes interactions of eight different publishing sites                                        R
                                                                                         1 X
in the month of February, 2016. The recorded stream                                   U=     rh                     (1)
                                                                                         R
of events include 2 million notifications, 58 thousand                                        h=1
item updates, and 168 million recommendation re-
                                                              2. Discounting
quests. The dataset also provides other information
like the title and text of each news article, time of pub-         In this we first discount each of the vectors present
lication etc. Each user can be identified by a unique              in the user reading history by a power of 2 such
id. For our task, we needed to find out the sequence in            that an article read at time t − 1 carries half the
which the articles were read by the users along with its           weight compared to an article read at time t. We
content. Since, we rely on implicit feedback we only               then take an average of all the vectors.
need to know whether an article was read by a user or
                                                                                             R
not.                                                                                     1 X rh
                                                                                    U=                              (2)
                                                                                         R  2R−h
                                                                                            h=1
4     Model Architecture
                                                              3. Exponential Discounting
In this section we briefly provide the description of
our model. We first discuss user profiling, followed by            In this we discount each of the vectors present in
the neural network architecture. We then provide the               the user reading history by a power of e such that
training criteria for our model.                                   an article read at time t − 1 carries 1/e the weight
                                                                   compared to an article read at time t. We then
                                                                   take an average of all the vectors.
4.1    User Profiling
                                                                                             R
The overview of this can be seen from Figure 1(A). We                                    1 X rh
                                                                                    U=                              (3)
first define a set of notations useful in understanding                                  R  eR−h
                                                                                            h=1
the creation of user profile. We define the number
of articles in the user reading history to be R. The         Using such a method, allows us to understand the pref-
doc2vec embeddings of each article in the history is         erences of the user based on the content of the articles
represented by rh where 1 ≤ h ≤ R. Each vector is of         read by the user. It also helps us to understand the
size 300. The user profile for a user is denoted by U .      temporal changes in users interests.
We now discuss three kinds of operations using which
we create the user profiles.                                 4.2    Neural Network Architecture
                                                             After the user profile is obtained, we then perform an
    1. Centroid
                                                             element-wise product between the profile and the em-
      In this method, we find the centroid of the embed-     bedding of the candidate article as can be seen from
      dings of the articles present in the reading history   Figure 1(B). These candidate articles are basically the
      of the user. The centroid then represents the user     positive and the negative samples used for training the
                                                                   0.8                                                             0.6
model. We then feed the element-wise product as in-
                                                                   0.7
                                                                                                                                   0.5
puts to a hidden layers of size 128. This is then fol-             0.6
lowed by two subsequent fully connected hidden layers                                                                              0.4


                                                                                                                          NDCG@K
                                                                   0.5


                                                           HR@K
of sizes 64 and 32. Finally we use the logistic unit to            0.4                                                             0.3

make predictions. A careful reader might have noticed              0.3
                                                                                                                                   0.2
                                                                   0.2
that, such an architecture gives us the capability to                                                                              0.1
                                                                   0.1
learn an arbitrary similarity function instead of tradi-              0                                                             0
                                                                          0   1   2   3   4   5   6   7   8   9   10 11                  0   1   2   3   4   5   6   7   8   9   10 11
tional metrics such as cosine similarity etc. which has                       Our Model        K
                                                                                              Key-VSM           U2U                          Our Model        K
                                                                                                                                                             Key-VSM           U2U
been normally used for calculating relevance. Typi-                              I2I
                                                                                Word
                                                                                                SVD           ItemPop                           I2I
                                                                                                                                               Word
                                                                                                                                                               SVD           ItemPop

cally, in matrix factorization, in order to make predic-
tions, a dot product between the user and the item
representation is computed i.e uT q where u is the user    Figure 2: Performance of our model vs some state-of-
representation and q is the item representation. How-      the-art models
ever, in our case we compute aout (ht (φ(u) φ(i)) where
aout and h represent the activation function (logistic            K                Avg                             Discounting                               Exponential
function) and the edge weights of the output layer and                         HR    NDCG                          HR    NDCG                                 HR   NDCG
φ(u), φ(i) represent non-linear transformation for user        1              0.319  0.319                        0.258  0.258                               0.237 0.237
and item respectively. An astute reader might notice           2              0.478  0.419                        0.404  0.350                               0.384 0.330
that, if we use an identity function for aout and en-          3              0.568  0.464                        0.506 0.4013                               0.483 0.379
force h to be a uniform vector of 1, we will be able to        4              0.624  0.489                        0.573  0.430                               0.550 0.408
recover the Matrix Factorization model. Hence, using           5              0.664  0.504                        0.619  0.447                               0.595 0.426
such an architecture helps us to retain the advantages         6              0.696  0.515                        0.654  0.460                               0.631 0.439
of collaborative filtering associated with news recom-         7              0.718  0.522                        0.678  0.468                               0.658 0.448
mendation.
                                                               8              0.737  0.529                        0.696  0.474                               0.680 0.454
                                                               9              0.754  0.533                        0.712  0.478                               0.697 0.460
4.3   Training
                                                              10              0.768  0.538                        0.724  0.482                               0.713 0.464
Since we only utilize the implicit feedback of users
available at our disposal, we pose the problem of rec-            Table 1: Performance with different user profiles
ommendation as that of binary classification where la-
bel 1 would mean highly recommended and 0 would               Baselines: We compare our method with several
mean not recommended. We use the binary cross en-          others. First we look at item popularity based method
tropy loss, also known as log loss, to learn the param-    (ItemPop). In this we recommend the most popu-
eters of the model.                                        lar items to the user. We then evaluate User-to-User
                                                           (U2U-KNN) and Item-to-Item (I2I-KNN) by setting
                                                           the neighbourhood size to 80. We then compare it with
5     Experiments                                          Singular Value Decomposition (SVD). We also imple-
As mentioned earlier we use the data provided by           ment Word Embeddings based Recommendations as
CLEF NewsReel 2017. We choose users who have               in (Mus16) and Keyword based Vector Space Model
read in between 8-15 (inclusive) articles for training     (Key-VSM) as mentioned in (Lop11).
and testing our model for item recommendation. The            Parameter Settings: We implemented our pro-
frequency of users who have read more than 15 arti-        posed model using Keras (Cho15). We then construct
cles varies extensively and hence we restrict ourselves    our training set as follows:
to the upper bound of 15. We set the lower bound
to 8 since we need some history in order to capture               1. We first define the reading history. We denote the
the changing user interests. However, for future work                reading history by h.
we would like to investigate how changing the lower
                                                                  2. Leaving the latest article read by each user, the
bound affects the performance of our model.
                                                                     remaining articles are used as positive samples.
   Evaluation Protocol: For each user we held-out
her latest interaction as the test set and utilized the           3. Corresponding to each positive sample, we ran-
remaining data for training. We then recommend a                     domly sample 4 negative instances (articles which
ranked list of articles to each user. The performance                the user did not read).
of a ranked list is judged by Hit Ratio (HR) and Nor-
malized Discounted Cumulative gain (NDCG). With-           We then randomly divide the training set into training
out special mention we truncate the ranked list at 10      and validation set in a 4:1 ratio. This helps us to
for both metrics.                                          ensure that the two sets do not overlap. We tuned the
         0.8                                                               0.55
                                                                                                                                    depicted in Figure 3. We see that choosing a size of 12
        0.75
                                                                                                                                    performs the best when using the averaging method for
                                                                            0.5                                                     profiling. While for the other two, a size of 8 performs


                                                                 NDCG@10
HR@10


         0.7                                                                                                                        the best. We then also experiment with the number
                                                                           0.45                                                     of negative samples for training the model parameters.
        0.65                        Average
                                  Discounting
                                                                                                       Average
                                                                                                     Discounting
                                                                                                                                    From Figure 4, we can see that increasing the number
         0.6
                             Exponential Discounting
                                                                            0.4
                                                                                                Exponential Discounting
                                                                                                                                    of negative samples improves the performance of the
                   8              10       12           14                            8              10       12           14
                            Reading History                                                    Reading History
                                                                                                                                    model but only up to a certain point, after which the
                                                                                                                                    performance of the model deteriorates.
Figure 3: Performance of our model w.r.t Reading his-                                                                                  We also evaluate the model on item cold-start and
tory of user                                                                                                                        find out that our model achieves an HR@10 score of
                                                                                                                                    around 0.32. While the typical collaborative filtering
         0.8                                                               0.55
                                                                                                                                    models would fail to do, using content vectors for ar-
        0.75                                                                0.5
                                                                                                                                    ticles provides our model the flexibility to account for
                                                                                                                                    these cases as well.
                                                                 NDCG@10
HR@10


         0.7                                                               0.45

                                                                                                                                    7      Conclusion and Future Work
        0.65                        Average                                 0.4                        Average
                                  Discounting
                             Exponential Discounting
                                                                                                     Discounting
                                                                                                Exponential Discounting
                                                                                                                                    In this work, we come up with a neural model for con-
         0.6
               0       1      2        3        4   5        6
                                                                           0.35
                                                                                  0       1      2        3        4   5        6
                                                                                                                                    tent collaborative filtering for news recommendations
                           Number of negatives                                                Number of negatives                   which incorporates both the user-item interaction pat-
                                                                                                                                    tern as well as the content of the news articles read by
                                                                                                                                    the user in the past. In future, we would like to explore
Figure 4: Performance of our model w.r.t number of                                                                                  more on deep recurrent models for user profiling.
negative samples

hyper-parameters of our model using the validation                                                                                  Acknowledgement
set. We use a batch size of 256.                                                                                                    We thank Kartik Gupta of Data Science and Analytics
                                                                                                                                    Centre at International Institute of Information Tech-
                                                                                                                                    nology Hyderabad for helping us in making a presen-
6              Results                                                                                                              tation of this work.
From Figure 2 we can see the results of our model
as compared with the baselines. Our model outper-                                                                                   References
forms the baselines by a significant margin in terms of
                                                                                                                                        [Bas04] Basilico, Justin, and Thomas Hofmann.
both HR and NDCG across all positions. This clearly
                                                                                                                                                ”Unifying collaborative and content-based
shows the effectiveness of our model in understanding
                                                                                                                                                filtering.” Proceedings of the twenty-first in-
the user preferences and making predictions accord-
                                                                                                                                                ternational conference on Machine learning.
ingly. Further it can be clearly noticed that U2U,
                                                                                                                                                ACM, 2004. APA
I2I and SVD do not perform well. One reason for
this could be the sparsity of the data. In presence of
                                                                                                                                        [Bel07] Bell, Robert M., and Yehuda Koren. ”Im-
sparse data these methods fail to capture relevant in-
                                                                                                                                                proved neighborhood-based collaborative fil-
formation. The low performance of Word Embedding
                                                                                                                                                tering.” KDD cup and workshop at the 13th
based Recommendations suggests that a representa-
                                                                                                                                                ACM SIGKDD international conference on
tion of words alone is not effective in profiling the user.
                                                                                                                                                knowledge discovery and data mining. sn,
The model also outperforms Key-VSM (Lop11) which
                                                                                                                                                2007. APA
suggests the effectiveness of the user profile component
used in our model.                                                                                                                   [Che12] Chen, Minmin, et al. ”Marginalized denois-
   In Table 1, we compare the results obtained by us-                                                                                        ing autoencoders for domain adaptation.”
ing different sorts of profiling method. The trend in the                                                                                    arXiv preprint arXiv:1206.4683 (2012).
performance can be seen as follows : Avg >Discount-
ing >Exponential. This suggests that all the articles                                                                                [Cho15] Chollet, Franois. ”Keras.” (2015): 128.
read by the user in a particular window have some im-
portance in predicting the article that the user would                                                                               [Hop16] Hopfgartner, Frank, et al. ”Benchmarking
be reading next.                                                                                                                             news recommendations: The clef newsreel
   Further we experiment on the size of reading history                                                                                      use case.” ACM SIGIR Forum. Vol. 49. No.
used as inputs to our model, the results for which are                                                                                       2. ACM, 2016.
 [Kel03] Kelly, Diane, and Jaime Teevan. ”Implicit        [Nin11] Ning, Xia, and George Karypis. ”Slim:
         feedback for inferring user preference: a bib-           Sparse linear methods for top-n recom-
         liography.” Acm Sigir Forum. Vol. 37. No. 2.             mender systems.” Data Mining (ICDM),
         ACM, 2003.                                               2011 IEEE 11th International Conference
                                                                  on. IEEE, 2011.
[Kha17] Khattar, Dhruv, Vaibhav Kumar, and Va-
        sudeva Varma. ”Leveraging Moderate User           [Oar98] Oard, Douglas W., and Jinmook Kim. ”Im-
        Data for News Recommendation.” Data                       plicit feedback for recommender systems.”
        Mining Workshops (ICDMW), 2017 IEEE                       Proceedings of the AAAI workshop on rec-
        International Conference on. IEEE, 2017.                  ommender systems. Vol. 83. WoUongong,
                                                                  1998.
 [Kor08] Koren, Yehuda. ”Factorization meets the
         neighborhood: a multifaceted collabora-          [Ren05] Rennie, Jasson DM, and Nathan Srebro.
         tive filtering model.” Proceedings of the                ”Fast maximum margin matrix factorization
         14th ACM SIGKDD international confer-                    for collaborative prediction.” Proceedings of
         ence on Knowledge discovery and data min-                the 22nd international conference on Ma-
         ing. ACM, 2008.                                          chine learning. ACM, 2005.

 [Kor09] Koren, Yehuda, Robert Bell, and Chris            [Sai14] Saia, Roberto, Ludovico Boratto, and Sal-
         Volinsky. ”Matrix factorization techniques               vatore Carta. ”Semantic Coherence-based
         for recommender systems.” Computer 42.8                  User Profile Modeling in the Recommender
         (2009).                                                  Systems Context.” KDIR. 2014.
                                                          [Sal07] Salakhutdinov, Ruslan, Andriy Mnih, and
[Kum17] Kumar, Vaibhav, et al. ”Word Semantics
                                                                  Geoffrey Hinton. ”Restricted Boltzmann
        based 3-D Convolutional Neural Networks
                                                                  machines for collaborative filtering.” Pro-
        for News Recommendation.” 2017 IEEE
                                                                  ceedings of the 24th international conference
        International Conference on Data Mining
                                                                  on Machine learning. ACM, 2007.
        Workshops (ICDMW). IEEE, 2017.
                                                          [Sar01] Sarwar, Badrul, et al. ”Item-based col-
[Kum17] Kumar, Vaibhav, et al. ”Deep Neural Archi-                laborative filtering recommendation algo-
        tecture for News Recommendation.” Work-                   rithms.” Proceedings of the 10th inter-
        ing Notes of the 8th International Confer-                national conference on World Wide Web.
        ence of the CLEF Initiative, Dublin, Ireland.             ACM, 2001.
        CEUR Workshop Proceedings. 2017.
                                                          [Sed15] Sedhain, Suvash, et al. ”Autorec: Autoen-
[Kum17] Kumar, Vaibhav, et al. ”User Profiling                    coders meet collaborative filtering.” Pro-
        Based Deep Neural Network for Tempo-                      ceedings of the 24th International Confer-
        ral News Recommendation.” Data Mining                     ence on World Wide Web. ACM, 2015.
        Workshops (ICDMW), 2017 IEEE Interna-
        tional Conference on. IEEE, 2017.                 [Str15] Strub, Florian, and Jeremie Mary. ”Collab-
                                                                  orative filtering with stacked denoising au-
  [Le14] Le, Quoc, and Tomas Mikolov. ”Dis-                       toencoders and sparse inputs.” NIPS work-
         tributed representations of sentences and                shop on machine learning for eCommerce.
         documents.” International Conference on                  2015.
         Machine Learning. 2014.

 [Lop11] Lops, Pasquale, Marco De Gemmis, and
         Giovanni Semeraro. ”Content-based recom-
         mender systems: State of the art and
         trends.” Recommender systems handbook.
         Springer, Boston, MA, 2011. 73-105.

[Mus16] Musto, Cataldo, et al. ”Learning word em-
        beddings from wikipedia for content-based
        recommender systems.” European Confer-
        ence on Information Retrieval. Springer,
        Cham, 2016.

</pre>