Word Embedding techniques for Content-based Recommender Systems: an empirical evaluation Cataldo Musto, Giovanni Semeraro, Marco De Gemmis, Pasquale Lops University of Bari Aldo Moro, Italy {name.surname}@uniba.it ABSTRACT tigated how well they do perform in recommender systems- This work presents an empirical comparison among three related tasks. To this aim, in this work we defined a very widespread word embedding techniques as Latent Semantic simple content-based recommendation framework based on Indexing, Random Indexing and the more recent Word2Vec. word embeddings, in order to assess about the effectiveness Specifically, we employed these techniques to learn a low- of such techniques in these scenarios as well. Specifically, dimensional vector space word representation and we ex- we first exploited word embedding techniques to represent ploited it to represent both items and user profiles in a words in vector spaces. Next, we inferred a vector-space rep- content-based recommendation scenario. The performance resentation of the items by summing the representation of of the techniques has been evaluated against two state-of- the words occurring in the document. Similarly, user profiles the-art datasets, and experimental results provided good in- are represented by summing the document representation of sights which pave the way to several future directions. the items the user liked. Finally, by exploiting classic simi- larity measures the available items can be ranked according to their descending similarity with respect to the user profile, 1. MOTIVATIONS AND METHODOLOGY and recommendations can be provided, in a typical Top-N Word Embedding techniques learn in a totally unsuper- recommendation setting. vised way a low-dimensional vector space representation of Clearly, this is a very basic formulation, since more fine- words by analyzing their usage in (very) large corpora of grained representations can be learned for both items and textual documents. These approaches are recently gaining users profiles. However, this work just aims to preliminarily more and more attention, since they showed very good per- evaluate the effectiveness of such representations in a sim- formance in a broad range of natural language processing- plified recommendation framework, in order to pave the way related scenarios, ranging from sentiment analysis and ma- to several future research directions in the area. chine translation to more challenging ones as learning a tex- Overview of the techniques. Latent Semantic Index- tual description of a given image1 . ing (LSI) [1] is a word embedding technique which applies In a nutshell, all these techniques employ a large cor- Singular Value Decomposition (SVD) over a word-document pora of documents to encode the co-occurences between the matrix. The goal of the approach is to compress the original terms, in order to learn both linguistic regularities as well as information space through SVD in order to obtain a smaller- semantic nuances, according to their usage. Next, given this scale word-concepts matrix, in which each column models a huge co-occurrences matrix, each technique use a different latent concept occurring the original vector space. Specif- approach to obtain a smaller low-dimensional representation ically, SVD is employed to unveil the latent relationships of each word occurring in the original corpus. An important between terms according to their usage in the corpus. feature which is common to all these technique is that the Next, Random Indexing (RI) [3], is an incremental tech- dimension of the representation (that is to say, the size of nique to learn a low-dimensional word representation relying the vectors) is just a parameter of the model, so it can be set on the principles of the Random Projection. It works in two according to specific constraint or peculiarities of the data. steps: first, a context vector is defined for each context (the However, although the effectiveness of such techniques (es- definition of the context is typically scenario-dependant, it pecially when combined with deep neural network architec- may be a paragraph, a sentence or the whole document). tures) is already taken for granted, just a few work inves- Each context vector is ternary (it contains values in {−1, 0, 1}) 1 very sparse, and its values are randomly distributed. Given http://googleresearch.blogspot.it/2014/11/a-picture-is- worth-thousand-coherent.html such context vectors, the vector space representation of each word is obtained by just summing over all the representa- tions of the contexts in which the word occurs. An impor- tant peculiarity of this approach is that it is incremental and scalable: if any new documents come into play, the vector space representation of the terms is updated by just adding the new occurrences of the terms in the new documents. Finally, Word2Vec (W2V) is a recent technique proposed by Mikolov et al. [2]. The approach learns a vector-space Copyright is held by the author/owner(s). representation of the terms by exploiting a two-layers neu- RecSys 2015 Poster Proceedings, September 16-20, 2015, Vienna, Austria. Table 1: Results of the experiments. The best word embedding approach is highlighted in bold. The best overall configuration is highlighted in bold and underlined. The baselines which are overcame by at least a word embedding are put in italics. MovieLens W2V RI LSI U2U I2I BPRMF Vector Size 300 500 300 500 300 500 F1@5 0.5056 0.5054 0.4921 0.4910 0.4645 0.4715 0.5217 0.5022 0.5141 F1@10 0.5757 0.5751 0.5622 0.5613 0.5393 0.5469 0.5969 0.5836 0.5928 F1@15 0.5672 0.5674 0.5349 0.5352 0.5187 0.5254 0.5911 0.5814 0.5876 DBbook W2V RI LSI U2U I2I BPRMF 300 500 300 500 300 500 F1@5 0.5183 0.5186 0.5064 0.5039 0.5056 0.5076 0.5193 0.5111 0.5290 F1@10 0.6207 0.6209 0.6239 0.6244 0.6256 0.6260 0.6229 0.6194 0.6263 F1@15 0.5829 0.5828 0.5892 0.5887 0.5908 0.5909 0.5777 0.5776 0.5778 ral network. In the first step, weights in the network are on MovieLens data, overcomes both W2V and RI on F1@10 randomly distributed as in RI. Next, the network is trained and F1@15. On a first sights these results indicate non- by using the Skip-gram methodology in order to model fine- generalizable outcomes. However, it is likely that such be- grained regularities in word usage. At each step, weights are havior depends on specific pecularities of the datasets which updated through Stochastic Gradient Descent and a vector- in turn influence the way the approaches learn their vector- space representation of each term is obtained by extracting space representations. A more throrough analysis is needed the weights of the network at the end of the training. to obtain general guidelines which drive the behavior of such approaches. Next, we compared our techniques to the above described 2. EXPERIMENTAL EVALUATION baselines. Results clearly show that the effectiveness of word In the experimental evaluation the performance of word embedding approaches is directly dependent on the sparsity embedding representations were compared against two state- of the data. This is an expected behavior since content- of-the-art datasets as MovieLens (ML) and DBbook (DB)2 . based approaches can better deal with cold-start situations. Moreover, we also compared the effectiveness of the best- In highly sparse dataset as DBbook (99.13% against 93.59% performing configurations to some widespead baselines. of MovieLens), content-based approaches based on word em- Experimental Design. Experiments were performed by bedding tend to overcome the baselines. Indeed, all the ap- adopting different protocols: as regards ML, we carried out proaches overcome I2I and U2U on F1@10 and F1@15 (W2V a 5-folds cross validation, while a single training/test split also overcomes I2I on F1@5). Furthermore, it is worth to was used for DB. Textual content was obtained by map- note that on F1@10 and F@15 word embeddings can obtain ping items to Wikipedia pages. For each word embedding results which are comparable (or even better on F1@15) to technique we compared two different size of learned vectors: those obtained by BPRMF. This is a very important out- 300 and 500. As regards the baselines, we exploited My- come, which definitely confirms the effectiveness of such MediaLite library3 . We evaluated User-to-User (U2U-KNN) techniques. Conversely, on less sparse datasets as Movie- and Item-to-Item Collaborative Filtering (I2I-KNN) as well Lens, CF algorithms overcome their content-based counter- as the Bayesian Personalized Ranking Matrix Factorization part. (BPRMF). U2U and I2I neighborhood size was set to 80. However, the overall outcomes emerging from this prelim- while BPRMF was run by setting the factor parameter equal inary investigations are very promising: given that no spe- to 100. In both cases we chose the optimal values for the cific NLP task was performed on the data, it is likely that parameters. Finally, statistical significance was assessed by a more thorough processing of the content can lead to even exploiting Wilcoxon and Friedman tests, chosen after run- better results. Thus, this investigation showed that word ning the Shapiro-Wilk test which revealed the non-normal embedding approaches can represent a very interesting al- distribution of the data. ternative to widespread CF approaches. In the following, we Discussion of the results. The first six columns of Ta- will further validate our results by also further investigating ble 1 provide the results of the comparison among the word the effectiveness of novel and richer textual data silos, as embedding techniques. As regards ML, W2V emerged as those coming from the Linked Open Data cloud. the best-performing configuration for all the metrics took into account. The gap is significant when compared to both RI and LSI. Moreover, results show that the size of the vec- 3. REFERENCES [1] Scott C. Deerwester, Susan T. Dumais, Thomas K. tors did not significantly affect the overall accuracy of the Landauer, George W. Furnas, and Richard A. algorithms (with the exception of LSI). This is an interest- Harshman. Indexing by latent semantic analysis. ing outcome since with an even smaller word representation, JASIS, 41:391–407, 1990. word embeddings can obtain good results. However, the [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S outcomes emerging from this first experiments are contro- Corrado, and Jeff Dean. Distributed representations of versial, since DBbook data provided opposite results: in this words and phrases and their compositionality. In NIPS, dataset W2V is the best-performing configuration only for pages 3111–3119, 2013. F1@5. On the other side, LSI, which performed the worst [3] Marcus Sahlgren. An introduction to Random 2 http://challenges.2014.eswc-conferences.org/index.php/RecSys Indexing. In Methods and Applications of Semantic 3 Indexing Workshop, TKE 2005, 2005. http://www.mymedialite.net/