=Paper=
{{Paper
|id=Vol-1441/recsys2015_poster23
|storemode=property
|title=Word Embedding Techniques for Content-based Recommender Systems: An Empirical Evaluation
|pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster23.pdf
|volume=Vol-1441
|dblpUrl=https://dblp.org/rec/conf/recsys/MustoSGL15
}}
==Word Embedding Techniques for Content-based Recommender Systems: An Empirical Evaluation==
<pdf width="1500px">https://ceur-ws.org/Vol-1441/recsys2015_poster23.pdf</pdf>
<pre>
             Word Embedding techniques for Content-based
             Recommender Systems: an empirical evaluation

                  Cataldo Musto, Giovanni Semeraro, Marco De Gemmis, Pasquale Lops
                                                    University of Bari Aldo Moro, Italy
                                                          {name.surname}@uniba.it


ABSTRACT                                                                  tigated how well they do perform in recommender systems-
This work presents an empirical comparison among three                    related tasks. To this aim, in this work we defined a very
widespread word embedding techniques as Latent Semantic                   simple content-based recommendation framework based on
Indexing, Random Indexing and the more recent Word2Vec.                   word embeddings, in order to assess about the effectiveness
Specifically, we employed these techniques to learn a low-                of such techniques in these scenarios as well. Specifically,
dimensional vector space word representation and we ex-                   we first exploited word embedding techniques to represent
ploited it to represent both items and user profiles in a                 words in vector spaces. Next, we inferred a vector-space rep-
content-based recommendation scenario. The performance                    resentation of the items by summing the representation of
of the techniques has been evaluated against two state-of-                the words occurring in the document. Similarly, user profiles
the-art datasets, and experimental results provided good in-              are represented by summing the document representation of
sights which pave the way to several future directions.                   the items the user liked. Finally, by exploiting classic simi-
                                                                          larity measures the available items can be ranked according
                                                                          to their descending similarity with respect to the user profile,
1.    MOTIVATIONS AND METHODOLOGY                                         and recommendations can be provided, in a typical Top-N
   Word Embedding techniques learn in a totally unsuper-                  recommendation setting.
vised way a low-dimensional vector space representation of                   Clearly, this is a very basic formulation, since more fine-
words by analyzing their usage in (very) large corpora of                 grained representations can be learned for both items and
textual documents. These approaches are recently gaining                  users profiles. However, this work just aims to preliminarily
more and more attention, since they showed very good per-                 evaluate the effectiveness of such representations in a sim-
formance in a broad range of natural language processing-                 plified recommendation framework, in order to pave the way
related scenarios, ranging from sentiment analysis and ma-                to several future research directions in the area.
chine translation to more challenging ones as learning a tex-                Overview of the techniques. Latent Semantic Index-
tual description of a given image1 .                                      ing (LSI) [1] is a word embedding technique which applies
   In a nutshell, all these techniques employ a large cor-                Singular Value Decomposition (SVD) over a word-document
pora of documents to encode the co-occurences between the                 matrix. The goal of the approach is to compress the original
terms, in order to learn both linguistic regularities as well as          information space through SVD in order to obtain a smaller-
semantic nuances, according to their usage. Next, given this              scale word-concepts matrix, in which each column models a
huge co-occurrences matrix, each technique use a different                latent concept occurring the original vector space. Specif-
approach to obtain a smaller low-dimensional representation               ically, SVD is employed to unveil the latent relationships
of each word occurring in the original corpus. An important               between terms according to their usage in the corpus.
feature which is common to all these technique is that the                   Next, Random Indexing (RI) [3], is an incremental tech-
dimension of the representation (that is to say, the size of              nique to learn a low-dimensional word representation relying
the vectors) is just a parameter of the model, so it can be set           on the principles of the Random Projection. It works in two
according to specific constraint or peculiarities of the data.            steps: first, a context vector is defined for each context (the
   However, although the effectiveness of such techniques (es-            definition of the context is typically scenario-dependant, it
pecially when combined with deep neural network architec-                 may be a paragraph, a sentence or the whole document).
tures) is already taken for granted, just a few work inves-               Each context vector is ternary (it contains values in {−1, 0, 1})
1                                                                         very sparse, and its values are randomly distributed. Given
  http://googleresearch.blogspot.it/2014/11/a-picture-is-
worth-thousand-coherent.html                                              such context vectors, the vector space representation of each
                                                                          word is obtained by just summing over all the representa-
                                                                          tions of the contexts in which the word occurs. An impor-
                                                                          tant peculiarity of this approach is that it is incremental and
                                                                          scalable: if any new documents come into play, the vector
                                                                          space representation of the terms is updated by just adding
                                                                          the new occurrences of the terms in the new documents.
                                                                             Finally, Word2Vec (W2V) is a recent technique proposed
                                                                          by Mikolov et al. [2]. The approach learns a vector-space
Copyright is held by the author/owner(s).                                 representation of the terms by exploiting a two-layers neu-
RecSys 2015 Poster Proceedings, September 16-20, 2015, Vienna, Austria.
Table 1: Results of the experiments. The best word embedding approach is highlighted in bold. The best overall configuration
is highlighted in bold and underlined. The baselines which are overcame by at least a word embedding are put in italics.

           MovieLens             W2V                  RI               LSI
                                                                                     U2U         I2I     BPRMF
           Vector Size       300     500         300     500      300      500
           F1@5            0.5056 0.5054        0.4921 0.4910    0.4645 0.4715      0.5217     0.5022    0.5141
           F1@10           0.5757 0.5751        0.5622 0.5613    0.5393 0.5469      0.5969     0.5836    0.5928
           F1@15           0.5672  0.5674       0.5349 0.5352    0.5187 0.5254      0.5911     0.5814    0.5876
            DBbook               W2V                  RI               LSI
                                                                                     U2U         I2I     BPRMF
                             300     500         300     500      300      500
           F1@5            0.5183  0.5186       0.5064 0.5039    0.5056 0.5076      0.5193     0.5111    0.5290
           F1@10           0.6207  0.6209       0.6239 0.6244    0.6256 0.6260      0.6229     0.6194    0.6263
           F1@15           0.5829  0.5828       0.5892 0.5887    0.5908 0.5909       0.5777    0.5776    0.5778


ral network. In the first step, weights in the network are       on MovieLens data, overcomes both W2V and RI on F1@10
randomly distributed as in RI. Next, the network is trained      and F1@15. On a first sights these results indicate non-
by using the Skip-gram methodology in order to model fine-       generalizable outcomes. However, it is likely that such be-
grained regularities in word usage. At each step, weights are    havior depends on specific pecularities of the datasets which
updated through Stochastic Gradient Descent and a vector-        in turn influence the way the approaches learn their vector-
space representation of each term is obtained by extracting      space representations. A more throrough analysis is needed
the weights of the network at the end of the training.           to obtain general guidelines which drive the behavior of such
                                                                 approaches.
                                                                    Next, we compared our techniques to the above described
2.   EXPERIMENTAL EVALUATION                                     baselines. Results clearly show that the effectiveness of word
   In the experimental evaluation the performance of word        embedding approaches is directly dependent on the sparsity
embedding representations were compared against two state-       of the data. This is an expected behavior since content-
of-the-art datasets as MovieLens (ML) and DBbook (DB)2 .         based approaches can better deal with cold-start situations.
Moreover, we also compared the effectiveness of the best-        In highly sparse dataset as DBbook (99.13% against 93.59%
performing configurations to some widespead baselines.           of MovieLens), content-based approaches based on word em-
   Experimental Design. Experiments were performed by            bedding tend to overcome the baselines. Indeed, all the ap-
adopting different protocols: as regards ML, we carried out      proaches overcome I2I and U2U on F1@10 and F1@15 (W2V
a 5-folds cross validation, while a single training/test split   also overcomes I2I on F1@5). Furthermore, it is worth to
was used for DB. Textual content was obtained by map-            note that on F1@10 and F@15 word embeddings can obtain
ping items to Wikipedia pages. For each word embedding           results which are comparable (or even better on F1@15) to
technique we compared two different size of learned vectors:     those obtained by BPRMF. This is a very important out-
300 and 500. As regards the baselines, we exploited My-          come, which definitely confirms the effectiveness of such
MediaLite library3 . We evaluated User-to-User (U2U-KNN)         techniques. Conversely, on less sparse datasets as Movie-
and Item-to-Item Collaborative Filtering (I2I-KNN) as well       Lens, CF algorithms overcome their content-based counter-
as the Bayesian Personalized Ranking Matrix Factorization        part.
(BPRMF). U2U and I2I neighborhood size was set to 80.               However, the overall outcomes emerging from this prelim-
while BPRMF was run by setting the factor parameter equal        inary investigations are very promising: given that no spe-
to 100. In both cases we chose the optimal values for the        cific NLP task was performed on the data, it is likely that
parameters. Finally, statistical significance was assessed by    a more thorough processing of the content can lead to even
exploiting Wilcoxon and Friedman tests, chosen after run-        better results. Thus, this investigation showed that word
ning the Shapiro-Wilk test which revealed the non-normal         embedding approaches can represent a very interesting al-
distribution of the data.                                        ternative to widespread CF approaches. In the following, we
   Discussion of the results. The first six columns of Ta-       will further validate our results by also further investigating
ble 1 provide the results of the comparison among the word       the effectiveness of novel and richer textual data silos, as
embedding techniques. As regards ML, W2V emerged as              those coming from the Linked Open Data cloud.
the best-performing configuration for all the metrics took
into account. The gap is significant when compared to both
RI and LSI. Moreover, results show that the size of the vec-
                                                                 3.   REFERENCES
                                                               [1] Scott C. Deerwester, Susan T. Dumais, Thomas K.
tors did not significantly affect the overall accuracy of the      Landauer, George W. Furnas, and Richard A.
algorithms (with the exception of LSI). This is an interest-       Harshman. Indexing by latent semantic analysis.
ing outcome since with an even smaller word representation,        JASIS, 41:391–407, 1990.
word embeddings can obtain good results. However, the          [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S
outcomes emerging from this first experiments are contro-          Corrado, and Jeff Dean. Distributed representations of
versial, since DBbook data provided opposite results: in this      words and phrases and their compositionality. In NIPS,
dataset W2V is the best-performing configuration only for          pages 3111–3119, 2013.
F1@5. On the other side, LSI, which performed the worst
                                                               [3] Marcus Sahlgren. An introduction to Random
2
  http://challenges.2014.eswc-conferences.org/index.php/RecSys     Indexing. In Methods and Applications of Semantic
3                                                                  Indexing Workshop, TKE 2005, 2005.
  http://www.mymedialite.net/

</pre>