=Paper=
{{Paper
|id=Vol-1448/paper4
|storemode=property
|title=Metadata Embeddings for User and Item Cold-start Recommendations
|pdfUrl=https://ceur-ws.org/Vol-1448/paper4.pdf
|volume=Vol-1448
|dblpUrl=https://dblp.org/rec/conf/recsys/Kula15
}}
==Metadata Embeddings for User and Item Cold-start Recommendations==
Metadata Embeddings for User and Item Cold-start Recommendations Maciej Kula Lyst maciej.kula@lyst.com ABSTRACT At Lyst, solving these problems is crucial. We are a fash- I present a hybrid matrix factorisation model representing ion company aiming to provide our users with a convenient users and items as linear combinations of their content fea- and engaging way to browse—and shop—for fashion online. tures’ latent factors. The model outperforms both collabo- To that end we maintain a very large product catalogue: rative and content-based models in cold-start or sparse in- at the time of writing, we aggregate over 8 million fashion teraction data scenarios (using both user and item meta- items from across the web, adding tens of thousands of new data), and performs at least as well as a pure collaborative products every day. matrix factorisation model where interaction data is abun- Three factors conspire to make recommendations chal- dant. Additionally, feature embeddings produced by the lenging for us. Firstly, our system contains a very large model encode semantic information in a way reminiscent of number of items. This makes our data very sparse. Sec- word embedding approaches, making them useful for a range ondly, we deal in fashion: often, the most relevant items of related tasks such as tag recommendations. are those from newly released collections, allowing us only a short window to gather data and provide effective recom- mendations. Finally, a large proportion of our users are first- Categories and Subject Descriptors time visitors: we would like to present them with compelling H.3.3 [Information Storage and Retrieval]: Information recommendations even with little data. This combination of Search and Retrieval—Information Filtering user and item cold-start makes both pure collaborative and content-based methods unsuitable for us. Keywords To solve this problem, I use a hybrid content-collaborative model, called LightFM due to its resemblance to factorisa- Recommender Systems, Cold-start, Matrix Factorization tion machines (see Section 3). In LightFM, like in a col- laborative filtering model, users and items are represented 1. INTRODUCTION as latent vectors (embeddings). However, just as in a CB Building recommender systems that perform well in cold- model, these are entirely defined by functions (in this case, start scenarios (where little data is available on new users linear combinations) of embeddings of the content features and items) remains a challenge. The standard matrix fac- that describe each product or user. For example, if the movie torisation (MF) model performs poorly in that setting: it is ‘Wizard of Oz’ is described by the following features: ‘mu- difficult to effectively estimate user and item latent factors sical fantasy’, ‘Judy Garland’, and ‘Wizard of Oz’, then its when collaborative interaction data is sparse. latent representation will be given by the sum of these fea- Content-based (CB) methods address this by representing tures’ latent representations. items through their metadata [10]. As these are known in In doing so, LightFM unites the advantages of content- advance, recommendations can be computed even for new based and collaborative recommenders. In this paper, I items for which no collaborative data has been gathered. formalise the model and present empirical results on two Unfortunately, no transfer learning occurs in CB models: datasets, showing that: models for each user are estimated in isolation and do not benefit from data on other users. Consequently, CB models 1. In both cold-start and low density scenarios, LightFM perform worse than MF models where collaborative infor- performs at least as well as pure content-based models, mation is available and require a large amount of data on substantially outperforming them when either (1) col- each user, rendering them unsuitable for user cold-start [1]. laborative information is available in the training set or (2) user features are included in the model. 2. When collaborative data is abundant (warm-start, dense user-item matrix), LightFM performs at least as well as the MF model. 3. Embeddings produced by LightFM encode important semantic information about features, and can be used CBRecSys 2015, September 20, 2015, Vienna, Austria. for related recommendation tasks such as tag recom- Copyright remains with the authors and/or original copyright holders. mendations. This has several benefits for real-world recommender sys- or in an unfavourable way (a negative interaction). The set tems. Because LightFM works well on both dense and sparse of all user-item interaction pairs (u, i) ∈ U × I is the union data, it obviates the need for building and maintaining mul- of both positive S + and negative interactions S − . tiple specialised machine learning models for each setting. Users and items are fully described by their features. Each Additionally, as it can use both user and item metadata, it user u is described by a set of features fu ⊂ F U . The same has the quality of being applicable in both item and user holds for each item i whose features are given by fi ⊂ F I . cold-start scenarios. The features are known in advance and represent user and To allow others to reproduce the results in this paper, I item metadata. have released a Python implementation of LightFM1 , and The model is parameterised in terms of d-dimensional user made the source code for this paper and all the experiments and item feature embeddings eU I f and ef for each feature f . available on Github2 . Each feature is also described by a scalar bias term (bU f for user and bIf for item features). 2. LIGHTFM The latent representation of user u is given by the sum of its features’ latent vectors: 2.1 Motivation qu = X U ej The structure of the LightFM model is motivated by two j∈fu considerations. The same holds for item i: 1. The model must be able to learn user and item repre- pi = X eIj sentations from interaction data: if items described as j∈fi ‘ball gown and ‘pencil skirt’ are consistently all liked by users, the model must learn that ball gowns are The bias term for user u is given by the sum of the features’ similar to pencil skirts. biases: X U bu = bj 2. The model must be able to compute recommendations j∈fu for new items and users. The same holds for item i: I fulfil the first requirement by using the latent representa- X tion approach. If ball gowns and pencil skirts are both liked bi = bIj by the same users, their embeddings will be close together; j∈fi if ball gowns and biker jackets are never liked by the same The model’s prediction for user u and item i is then given users, their embeddings will be far apart. by the dot product of user and item representations, ad- Such representations allow transfer learning to occur. If justed by user and item feature biases: the representations for ball gowns and pencil skirts are simi- lar, we can confidently recommend ball gowns to a new user rbui = f (q u · pi + bu + bi ) (1) who has so far only interacted with pencil skirts. There is a number of functions suitable for f (·). An identity This is over and above what pure CB models using di- function would work well for predicting ratings; in this pa- mensionality reduction techniques (such as latent semantic per, I am interested in predicting binary data, and so after indexing, LSI) can achieve, as these only encode information Rendle et al. [16] I choose the sigmoid function given by feature co-occurrence rather than user actions. For example, suppose that all users who look at items described 1 f (x) = . as aviators also look at items described as wayfarers, but 1 + exp(−x) the two features never describe the same item. In this case, The optimisation objective for the model consists in max- the LSI vector for wayfarers will not be similar to the one imising the likelihood of the data conditional on the param- for aviators even though collaborative information suggests eters. The likelihood is given by it should be. Y Y I fulfil the second requirement by representing items and L eU , eI , b U , b I = rbui × (1 − rbui ) (2) users as linear combinations of their content features. Be- (u,i)∈S + (u,i)∈S − cause content features are known the moment a user or item I train the model using asynchronous stochastic gradient enters the system, this allows recommendations to be made descent [14]. I use four training threads for experiments straight away. The resulting structure is also easy to un- performed in this paper. The per-parameter learning rate derstand. The representation for denim jacket is simply a schedule is given by Adagrad [6]. sum of the representation of denim and the representation of jacket; the representation for a female user from the US 2.3 Relationship to Other Models is a sum of the representations of US and female users. The relationship between LightFM and the collaborative 2.2 The Model MF model is governed by the structure of the user and item feature sets. If the feature sets consist solely of indicator To describe the model formally, let U be the set of users, variables for each user and item, LightFM reduces to the I be the set of items, F U be the set of user features, and F I standard MF model. If the feature sets also contain meta- the set of item features. Each user interacts with a number data features shared by more than one item or user, LightFM of items, either in a favourable way (a positive interaction), extends the MF model by letting the feature latent factors 1 https://github.com/lyst/lightfm/ explain part of the structure of user interactions. 2 https://github.com/lyst/lightfm-paper/ This is important on three counts. 1. In most applications there will be fewer metadata fea- simplicity as its single optimisation objective is to factorise tures than there are users or items, either because the user-item matrix. an ontology with a fixed type/category structure is Shmueli et al. [20] represent items as linear combinations used, or because a fixed-size dictionary of most com- of their features’ latent factors to recommend news articles; mon terms is maintained when using raw textual fea- like LightFM, they use a single-objective approach and min- tures. This means that fewer parameters need to be es- imise the user-item matrix reproduction loss. They show timated from limited training data, reducing the risk of their approach to be successful in a modified cold-start set- overfitting and improving generalisation performance. ting, where both metadata and data on other users who have commented on a given article is available. However, their ap- 2. Latent vectors for indicator variables cannot be esti- proach does not extend to modelling user features and does mated for new, cold-start users or items. Representing not provide evidence on model performance in warm-start these as combinations of metadata features that can scenario. be estimated from the training set makes it possible to LightFM fits into the hybrid model tradition by jointly make cold-start predictions. factorising the user-item, item-feature, and user-feature ma- 3. If only indicator features are present, LightFM should trices. From a theory standpoint, it can be construed as a perform on par with the standard MF model. special case of Factorisation Machines [15]. FMs provide an efficient method of estimating variable in- When only metadata features and no indicator variables teraction terms in linear models under sparsity. Each vari- are present, the model in general does not reduce to a pure able is represented by a k-dimensional latent factor; the in- content-based system. LightFM estimates feature embed- teraction between variable i and j is then given by the dot dings by factorising the collaborative interaction matrix; this product of their latent factors. This has the advantage of is unlike content-based systems which (when dimensionality reducing the number of parameters to be estimated. reduction is used) factorise pure content co-occurrence ma- LightFM further restricts the interaction structure by only trices. estimating the interactions between user and item features. One special case where LightFM does reduce to a pure This aids the interpretability of resulting feature embed- CB model is where each user is described by an indicator dings. variable and has interacted only with one item. In that setting, the user vector is equivalent to a document vector in the LSI formulation, and only features which occur together 4. DATASETS in product descriptions will have similar embeddings. I evaluate LightFM’s performance on two datasets. The The fact that LightFM contains both the pure CB model datasets span the range of dense interaction data, where at the sparse data end of the spectrum and the MF model at MF models can be expected to perform well (MovieLens), the dense end suggests that it should adapt well to datasets and sparse data, where CB models tend to perform better of varying sparsity. In fact, empirical results show that it (CrossValidated). Both datasets are freely available. performs at least as well as the appropriate specialised model in each scenario. 4.1 MovieLens The first experiment uses the well-known MovieLens 10M 3. RELATED WORK dataset3 , combined with the Tag Genome tag set [22]. There are a number of related hybrid models attempting The dataset consists of approximately 10 million movie to solve the cold-start problem by jointly modelling content ratings, submitted by 71, 567 users on 10, 681 movies. All and collaborative data. movies are described by their genres and a list of tags from Soboroff et al. [21] represent users as linear combinations the Tag Genome. Each movie-tag pair is accompanied by a of the feature vectors of items they have interacted with. relevance score (between 0 and 1), denoting how accurately They then perform LSI on the resulting item-feature ma- a given tag describes the movie. trix to obtain latent user profiles. Representations of new To binarise the problem, I treat all ratings below 4.0 (out items are obtained by projecting them onto the latent fea- of a 1 to 5 scale) as negative; all ratings equal to or above 4.0 ture space. The advantage of the model, relative to pure are positive. I also filter out all ratings that fall below the CB approaches, consists in using collaborative information 0.8 relevance threshold to retain only highly relevant tags. encoded in the user-feature matrix. However, it models user The final dataset contains 69, 878 users, 10, 681 items, preferences as being defined over individual features them- 9, 996, 948 interactions, and 1030 unique tags. selves instead of over items (sets of features). This is unlike LightFM, where a feature’s effect in predicting an interac- 4.2 CrossValidated tion is always taken in the context of all other features char- The second dataset consists of questions and answers posted acterising a given user-item pair. on CrossValidated4 , a part of the larger network of Stack- Saveski et al. [18] perform joint factorisation of the user- Exchange collaborative Q&A sites that focuses on statistics item and item-feature matrices by using the same item latent and machine learning. The dataset5 consists of 5953 users, feature matrix in both decompositions; the parameters are 44, 200 questions, and 188, 865 answers and comments. Each optimised by minimising a weighted sum of both matrices’ question is accompanied by one or more of 1032 unique tags reproduction loss functions. A weight hyperparameter gov- (such as ‘regression’ or ‘hypothesis-testing’). Additionally, erns the relative importance of accuracy in decomposing the 3 collaborative and content matrices. A similar approach is http://grouplens.org/datasets/movielens/ 4 used by McAuley et al. [11] for jointly modelling ratings http://stats.stackexchange.com 5 and product reviews. Here, LightFM has the advantage of https://archive.org/details/stackexchange user metadata is available in the form of ‘About Me’ sections Table 1: Results on users’ profiles. The recommendation goal is to match users with questions CrossValidated MovieLens they can answer. A user answering a question is taken as Warm Cold Warm Cold an implicit positive signal; all questions that a user has not answered are treated as implicit negative signals. For the LSI-LR 0.662 0.660 0.686 0.690 training and test sets, I construct 3 negative training pairs LSI-UP 0.636 0.637 0.687 0.681 for each positive user-question pair by randomly sampling MF 0.541 0.508 0.762 0.500 from all questions that a given user has not answered. LightFM (tags) 0.675 0.675 0.744 0.707 To keep the model simple, I focus on a user’s willingness LightFM (tags + ids) 0.682 0.674 0.763 0.716 to answer a question rather than their ability, and forego LightFM (tags + about) 0.695 0.696 modelling user expertise [17]. obtained through projecting them onto the latent fea- 5. EXPERIMENTAL SETUP ture space. The recommendations score for a user-item For each dataset, I perform two experiments. The first pair is then the inner product of their latent represen- simulates a warm-start setting: 20% of all interaction pairs tations. are randomly assigned to the test set, but all items and users are represented in the training set. The second is an 4. LightFM (tags): the LightFM model using only tag item cold-start scenario: all interactions pertaining to 20% features. of items are removed from the training set and added to the test set. This approximates a setting where the recom- 5. LightFM (tags + ids): the LightFM model using mender is required to make recommendations from a pool of both tag and item indicator features. items for which no collaborative information has been gath- 6. LightFM (tags + about): the LightFM model using ered, and only content metadata (tags) are available. both item and user features. User features are avail- I measure model accuracy using the mean receiver operat- able only for the CrossValidated dataset. I construct ing characteristics area under the curve (ROC AUC) metric. them by converting the ‘About Me’ sections of users’ For an individual user, AUC corresponds to the probability profiles to a bag-of-words representation. I first strip that a randomly chosen positive item will be ranked higher them of all HTML tags and non-alphabetical charac- than a randomly chosen negative item. A high AUC score ters, then convert the resulting string to lowercase and is equivalent to low rank-inversion probability, where the tokenise on spaces. recommender mistakenly ranks an unattractive item higher than an attractive item. I compute this metric for all users In both LightFM (tags) and LightFM (tags + ids) users are in the test set and average it for the final score. described only by indicator features. I compute the AUC metric by repeatedly randomly split- I train the LightFM models using stochastic gradient de- ting the dataset into a 80% training set and a 20% test set. scent with an initial learning rate of 0.05. The latent dimen- The final score is given averaging across 10 repetitions. sionality of the models is set to 64 for all models and exper- I test the following models: iments. This setting is intended to reflect the balance be- tween model accuracy and the computational cost of larger 1. MF: a conventional matrix factorisation model with vectors in production systems (additional results on model user and item biases and a sigmoid link function [8]. sensitivity to this parameter are presented in Section 6.2). I regularise the model through an early-stopping criterion: 2. LSI-LR: a content-based model. To estimate it, I the training is stopped when the model’s performance on first derive latent topics from the item-feature matrix the test set stops improving. through latent semantic indexing and represent items as linear combinations of latent topics. I then fit a separate logistic regression (LR) model for each user 6. EXPERIMENTAL RESULTS in the topic mixture space. Unlike the LightFM model, which uses collaborative data to produce its latent rep- 6.1 Recommendation accuracy resentation, LSI-LR is purely based on factorising the Experimental results are summarised in Table 1. LightFM content matrix. It should therefore be helpful in high- performs very well, outperforming or matching the specialised lighting the benefit of using collaborative information model for each scenario. for constructing feature embeddings. In the warm-start, low-sparsity case (warm-start Movie- Lens), LightFM outperforms MF slightly when using both 3. LSI-UP: a hybrid model that represents user profiles tag and item indicator features. This suggest that using (UP) as linear combinations of items’ content vectors, metadata features may be valuable even when abundant in- then applies LSI to the resulting matrix to obtain la- teraction data is present. tent user and item representations ([21], see Section Notably, LightFM (tags) almost matches MF performance 3). I estimate this model by first constructing a user- despite using only metadata features. The LSI-LR and LSI- feature matrix: each row represents a user and is given UP models using the same information fare much worse. by the sum of content feature vectors representing the This demonstrates that (1) it is crucial to use collaborative items that user positively interacted with. I then ap- information when estimating content feature embeddings, ply truncated SVD to the normalised matrix to obtain and (2) LightFM can capture that information much more user and feature latent vectors; item latent vectors are accurately than other hybrid models such as LSI-UP. In the warm-start, high-sparsity case (warm-start Cross- Table 2: Tag similarity Validated), MF performs very poorly. Because user interac- tion data is sparse (the CrossValidated user-item matrix is Query tag Similar tags 99.95% sparse vs only 99% for the MovieLens dataset), MF is ‘regression’ ‘least squares’, ‘multiple regression’, ‘re- unable to learn good latent representations. Content-based gression coefficients’, ‘multicollinearity’ models such as LSI-LR perform much better. ‘MCMC’ ‘BUGS’, ‘Metropolis-Hastings’, ‘Beta- LightFM variants provide the best performance. LightFM Binomial’, ‘Gibbs’, ‘Bayesian’ (tags + about) is by far the best model, showing the added ‘survival’ ‘epidemiology’, ‘Cox model’, ‘Kaplan- advantage of LightFM’s ability to integrate user metadata Meier’, ‘hazard’ embeddings into the recommendation model. This is likely ‘art house’ ‘pretentious’, ‘boring’, ‘graphic novel’, due to improved prediction performance for users with little ‘pointless’, ‘weird’ data in the training set. ‘dystopia’ ‘post-apocalyptic’, ‘futuristic’, ‘artificial in- Results for the cold-start cases are broadly similar. On telligence’ the CrossValidated dataset, all variants of LightFM outper- ‘bond’ ‘007’, ‘secret service’, ‘nuclear bomb’, ‘spy- form other models; LightFM (tags + about) again provides ing’, ‘assassin’ the best performance. Interestingly, LightFM (tags + indi- cators) outperforms LightFM (tags) slightly on the Movie- Lens dataset, even though no embeddings can be estimated dimensional LSI-LR model even when using fewer than 32 for movies in the test set. This suggests that using both dimensions. metadata and per-movie features allows the model to esti- This is an important win for large-scale recommender sys- mate better embeddings for both, much like the use of user tems, where the choice of d is governed by a trade-off be- and item bias terms allows better latent factors to be com- tween vector size and recommendation accuracy. Since smaller puted. Unsurprisingly, MF performs no better than random vectors occupy less memory and use fewer computations dur- in the cold-start case. ing query time, better representational power at small d al- In all scenarios the LSI-UP model performs no better than lows the system to achieve the same model performance at the LSI-LR model, despite its attempt to incorporate col- a smaller computational cost. laborative data. On the CrossValidated dataset it performs strictly worse. This might be because its latent representa- 6.3 Tag embeddings tions are estimated on less data than in LSI-LR: as there are Feature embeddings generated by the LightFM model cap- fewer users than items in the dataset, there are fewer rows ture important information about the semantic relationships in the user-feature matrix than in the item-feature matrix. between different features. Table 2 gives some examples by The results confirm that LightFM encompasses both the listing groups of tags similar (in the cosine similarity sense) MF and the LSI-LR model as special cases, performing bet- to a given query tag. ter than the LSI-LR model in the sparse-data scenario and In this respect, LightFM is similar to recent word em- better than the MF model in the dense-data case. This bedding approaches like word2vec and GloVe [12, 13]. This means not only that a single model can be maintained in is perhaps unsurprising, given that word embedding tech- either settings, but also that the model will continue to niques are closely related to forms of matrix factorisation perform well even when the sparsity structure of that data [9]. Nevertheless, LightFM and word embeddings differ in changes. one important respect: whilst word2vec and GloVe embed- Good performance of LightFM (tags) in both datasets dings are driven by textual corpus co-incidence statistics, is predicated on the availability of high-quality metadata. LightFM is based on user interaction data. Nevertheless, it is often possible to obtain good quality meta- LightFM embeddings are useful for a number of recom- data from item descriptions (genres, actor lists and so on), mendation tasks. expert or community tagging (Pandora [23], StackOverflow ), or computer vision systems where image or audio data is 1. Tag recommendation. Various applications use col- available (we use image-based convolutional neural networks laborative tagging as a way of generating richer meta- for product tagging). In fact, the feature embeddings pro- data for use in search and recommender system [2, 7]. duced by LightFM can themselves be used to assist the tag- A tag recommender can enhance this process by either ging process by suggesting related tags. automatically applying matching tags, or generating suggested tags lists for approval by users. LightFM- 6.2 Parameter Sensitivity produced tag embeddings will work well for this task without the need to build a separate specialised model Figure 1 plots the accuracy of LightFM, LSI-LR, and LSI- for tag recommendations. UP against values of the latent dimensionality hyperparam- eter d in the cold-start scenario (averaged over 30 runs of 2. Genre or category recommendation. Many do- each algorithm). As d increases, each model is capable of mains are characterised by an ontology of genres or modelling more complex structures and achieves better per- categories which play an important role in the presen- formance. tation of recommendations. For example, the Netflix Interestingly, LightFM performs very well even with a interface is organised in genre rows; for Lyst, fashion small number of dimensions. In both datasets LightFM designers, categories and subcategories are fundamen- consistently outperforms other models, achieving high per- tal. The degree of similarity between the embeddings formance with as few as 16 dimensions. On CrossValidated of genres or categories provides a ready basis for genre data, it achieves the same performance as the LSI-LR model or category recommendations that respect the seman- for much smaller d: it matches the accuracy of the 512- tic structure of the ontology. Figure 1: Latent dimension sensitivity 0.70 0.73 0.68 0.72 0.66 0.71 ROC AUC ROC AUC 0.64 0.70 0.62 0.69 LSI-LR LSI-UP 0.60 LightFM (tags) 0.68 LightFM (tags + ids) LightFM (tags + about) 0.58 0.67 4 8 16 32 64 128 256 512 4 8 16 32 64 128 256 512 d d (a) CrossValidated (b) MovieLens 3. Recommendation justification. Rich information 7.2 Feature engineering encoded in feature embeddings can help provide expla- Each of our products is described by a set of textual fea- nations for recommendations made by the system. For tures as well as structured metadata such as its type (dress, example, we might recommend a ball gown to a user shoes and so on) or designer. These are accompanied by who likes pencil skirts, and justify it by the two fea- additional features coming from two sources. tures’ similarity as revealed by the distance between Firstly, we employ a team of experienced fashion modera- their latent factors. tors, helping us to derive more fine-grained features such as clothing categories and subcategories (peplum dress, halter- 7. USAGE IN PRODUCTION SYSTEMS neck and so on). The LightFM approach is motivated by our experience Secondly, we use machine learning systems for automatic at Lyst. We have deployed LightFM in production, and feature detection. The most important of these is a set successfully use it for a number of recommendation tasks. In of deep convolutional neural networks deriving feature tags this section, I describe some of the engineering and algorithm from product image data. choices that make this possible. 7.3 Approximate nearest neighbour searches 7.1 Model training and fold-in The biggest application of LightFM-derived item repre- Thousands of new items and users appear on Lyst every sentations are related product recommendations: given a day. To cope with this, we train our LightFM model in product, we would like to recommend other highly relevant an online manner, continually updating the representations products. To do this efficiently across 8 million products, we of existing features and creating fresh representations for use a combination of approximate (for on-demand recom- features that we have never observed before. mendations) and exact (for near-line computation) nearest We store model state, including feature embeddings and neighbour search. accumulated squared gradient information in a database. For approximate nearest neighbour (ANN) queries, we use When new data on user interaction arrives, we restore the Random Projection (RP) trees [4, 5]. RP trees are a vari- model state and resume training, folding in any newly ob- ant of random-projection [3] based locality sensitive hashing served features. Since our implementation uses per-parameter (LSH). diminishing learning rates (Adagrad), any updates of es- In LSH, k-bit hash codes for each point x are generated tablished features will be incremental as the model adapts to by drawing random hyperplanes v, and then setting the k-th new data. For new features, a high learning rate is used to bit of the hash code to 1 if x · v ≥ 0 and 0 otherwise. The allow useful embeddings to be learned as quickly as possible. approximate nearest neighbours of x are then other points No re-training is necessary for folding in new products: that share the same hash code (or whose hash codes are their representation can be immediately computed as the within some small Hamming distance of each other). sum of the representations of their features. While extremely fast, LSH has the undesirable property of sometimes producing very highly unbalanced distribution of points across all hash codes: if points are densely con- centrated, many codes of the tree will apply to no products while some will describe a very large number of points. This [5] S. Dasgupta and K. Sinha. Randomized partition trees is unacceptable when building a production system, as it for exact nearest neighbor search. arXiv preprint will lead to many queries being very slow. arXiv:1302.1948, 2013. RP trees provide much better guarantees about the size [6] J. Duchi, E. Hazan, and Y. Singer. Adaptive of leaf nodes: at each internal node, points are split based subgradient methods for online learning and stochastic on the median distance to the chosen random hyperplane. optimization. The Journal of Machine Learning This guarantees that at every split approximately half the Research, 12:2121–2159, 2011. points will be allocated to each leaf, making the distribution [7] R. Jäschke, L. Marinho, A. Hotho, of points (and query performance) much more predictable. L. Schmidt-Thieme, and G. Stumme. Tag recommendations in folksonomies. In Knowledge 8. CONCLUSIONS AND FUTURE WORK Discovery in Databases: PKDD 2007, pages 506–514. In this paper, I have presented an effective hybrid recom- Springer, 2007. mender model dubbed LightFM. I have shown the following: [8] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. 1. LightFM performs at least as well as a specialised Computer, (8):30–37, 2009. model across a wide range of collaborative data spar- [9] O. Levy and Y. Goldberg. Neural word embedding as sity scenarios. It outperforms existing content-based implicit matrix factorization. In Advances in Neural and hybrid models in cold-start scenarios where col- Information Processing Systems, pages 2177–2185, laborative data is abundant or where user metadata is 2014. available. [10] P. Lops, M. De Gemmis, and G. Semeraro. Content-based recommender systems: State of the art 2. It produces high-quality content feature embeddings and trends. In Recommender systems handbook, pages that capture important semantic information about 73–105. Springer, 2011. the problem domain, and can be used for related tasks such as tag recommendations. [11] J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with Both properties make LightFM an attractive model, appli- review text. In Proceedings of the 7th ACM conference cable both in cold- and warm-start settings. Nevertheless, on Recommender systems, pages 165–172. ACM, 2013. I see two promising directions in extending the current ap- [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean. proach. Efficient estimation of word representations in vector Firstly, the model can be easily extended to use more so- space. arXiv preprint arXiv:1301.3781, 2013. phisticated training methodologies. For example, an optimi- [13] J. Pennington, R. Socher, and C. D. Manning. Glove: sation scheme using Weighted Approximate-Rank Pairwise Global vectors for word representation. Proceedings of loss [24] or directly optimising mean reciprocal rank could the Empiricial Methods in Natural Language be used [19]. Processing (EMNLP 2014), 12, 2014. Secondly, there is no easy way of incorporating visual or [14] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A audio features in the present formulation of LightFM. At lock-free approach to parallelizing stochastic gradient Lyst, we use a two-step process to address this: we first descent. In Advances in Neural Information Processing use convolutional neural networks (CNNs) on image data Systems, pages 693–701, 2011. to generate binary tags for all products, and then use the [15] S. Rendle. Factorization machines. In Data Mining tags for generating recommendations. We conjecture that (ICDM), 2010 IEEE 10th International Conference substantial improvements could be realised if the CNNs were on, pages 995–1000. IEEE, 2010. trained with recommendation loss directly. [16] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. BPR: Bayesian personalized 9. REFERENCES ranking from implicit feedback. In Proceedings of the [1] G. Adomavicius and A. Tuzhilin. Toward the next Twenty-Fifth Conference on Uncertainty in Artificial generation of recommender systems: A survey of the Intelligence, pages 452–461. AUAI Press, 2009. state-of-the-art and possible extensions. Knowledge [17] J. San Pedro and A. Karatzoglou. Question and Data Engineering, IEEE Transactions on, recommendation for collaborative question answering 17(6):734–749, 2005. systems with RankSLDA. In Proceedings of the 8th [2] M. Bastian, M. Hayes, W. Vaughan, S. Shah, ACM Conference on Recommender systems, pages P. Skomoroch, H. Kim, S. Uryasev, and C. Lloyd. 193–200. ACM, 2014. Linkedin skills: large-scale topic extraction and [18] M. Saveski and A. Mantrach. Item cold-start inference. In Proceedings of the 8th ACM Conference recommendations: learning local collective on Recommender systems, pages 1–8. ACM, 2014. embeddings. In Proceedings of the 8th ACM [3] S. Dasgupta. Experiments with random projection. In Conference on Recommender systems, pages 89–96. Proceedings of the Sixteenth conference on Uncertainty ACM, 2014. in artificial intelligence, pages 143–151. Morgan [19] Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson, Kaufmann Publishers Inc., 2000. N. Oliver, and A. Hanjalic. CLiMF: learning to [4] S. Dasgupta and Y. Freund. Random projection trees maximize reciprocal rank with collaborative and low dimensional manifolds. In Proceedings of the less-is-more filtering. In Proceedings of the 6th ACM fortieth annual ACM symposium on Theory of onference on Recommender systems, pages 139–146. computing, pages 537–546. ACM, 2008. ACM, 2012. [20] E. Shmueli, A. Kagian, Y. Koren, and R. Lempel. Care to Comment?: Recommendations for commenting on news stories. In Proceedings of the 21st international conference on World Wide Web, pages 429–438. ACM, 2012. [21] I. Soboroff and C. Nicholas. Combining content and collaboration in text filtering. In Proceedings of the IJCAI, volume 99, pages 86–91, 1999. [22] J. Vig, S. Sen, and J. Riedl. The tag genome: Encoding community knowledge to support novel interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2, 2012. [23] T. Westergren. The music genome project. Online: http://pandora. com/mgp, 2007. [24] J. Weston, S. Bengio, and N. Usunier. WSABIE: Scaling up to large vocabulary image annotation. In IJCAI, volume 11, pages 2764–2770, 2011.