=Paper=
{{Paper
|id=Vol-1448/paper4
|storemode=property
|title=Metadata Embeddings for User and Item Cold-start Recommendations
|pdfUrl=https://ceur-ws.org/Vol-1448/paper4.pdf
|volume=Vol-1448
|dblpUrl=https://dblp.org/rec/conf/recsys/Kula15
}}
==Metadata Embeddings for User and Item Cold-start Recommendations==
<pdf width="1500px">https://ceur-ws.org/Vol-1448/paper4.pdf</pdf>
<pre>
          Metadata Embeddings for User and Item Cold-start
                       Recommendations

                                                                    Maciej Kula
                                                                        Lyst
                                                            maciej.kula@lyst.com


ABSTRACT                                                                          At Lyst, solving these problems is crucial. We are a fash-
I present a hybrid matrix factorisation model representing                     ion company aiming to provide our users with a convenient
users and items as linear combinations of their content fea-                   and engaging way to browse—and shop—for fashion online.
tures’ latent factors. The model outperforms both collabo-                     To that end we maintain a very large product catalogue:
rative and content-based models in cold-start or sparse in-                    at the time of writing, we aggregate over 8 million fashion
teraction data scenarios (using both user and item meta-                       items from across the web, adding tens of thousands of new
data), and performs at least as well as a pure collaborative                   products every day.
matrix factorisation model where interaction data is abun-                        Three factors conspire to make recommendations chal-
dant. Additionally, feature embeddings produced by the                         lenging for us. Firstly, our system contains a very large
model encode semantic information in a way reminiscent of                      number of items. This makes our data very sparse. Sec-
word embedding approaches, making them useful for a range                      ondly, we deal in fashion: often, the most relevant items
of related tasks such as tag recommendations.                                  are those from newly released collections, allowing us only
                                                                               a short window to gather data and provide effective recom-
                                                                               mendations. Finally, a large proportion of our users are first-
Categories and Subject Descriptors                                             time visitors: we would like to present them with compelling
H.3.3 [Information Storage and Retrieval]: Information                         recommendations even with little data. This combination of
Search and Retrieval—Information Filtering                                     user and item cold-start makes both pure collaborative and
                                                                               content-based methods unsuitable for us.
Keywords                                                                          To solve this problem, I use a hybrid content-collaborative
                                                                               model, called LightFM due to its resemblance to factorisa-
Recommender Systems, Cold-start, Matrix Factorization                          tion machines (see Section 3). In LightFM, like in a col-
                                                                               laborative filtering model, users and items are represented
1.    INTRODUCTION                                                             as latent vectors (embeddings). However, just as in a CB
   Building recommender systems that perform well in cold-                     model, these are entirely defined by functions (in this case,
start scenarios (where little data is available on new users                   linear combinations) of embeddings of the content features
and items) remains a challenge. The standard matrix fac-                       that describe each product or user. For example, if the movie
torisation (MF) model performs poorly in that setting: it is                   ‘Wizard of Oz’ is described by the following features: ‘mu-
difficult to effectively estimate user and item latent factors                 sical fantasy’, ‘Judy Garland’, and ‘Wizard of Oz’, then its
when collaborative interaction data is sparse.                                 latent representation will be given by the sum of these fea-
   Content-based (CB) methods address this by representing                     tures’ latent representations.
items through their metadata [10]. As these are known in                          In doing so, LightFM unites the advantages of content-
advance, recommendations can be computed even for new                          based and collaborative recommenders. In this paper, I
items for which no collaborative data has been gathered.                       formalise the model and present empirical results on two
Unfortunately, no transfer learning occurs in CB models:                       datasets, showing that:
models for each user are estimated in isolation and do not
benefit from data on other users. Consequently, CB models                        1. In both cold-start and low density scenarios, LightFM
perform worse than MF models where collaborative infor-                             performs at least as well as pure content-based models,
mation is available and require a large amount of data on                           substantially outperforming them when either (1) col-
each user, rendering them unsuitable for user cold-start [1].                       laborative information is available in the training set
                                                                                    or (2) user features are included in the model.

                                                                                 2. When collaborative data is abundant (warm-start, dense
                                                                                    user-item matrix), LightFM performs at least as well
                                                                                    as the MF model.

                                                                                 3. Embeddings produced by LightFM encode important
                                                                                    semantic information about features, and can be used
CBRecSys 2015, September 20, 2015, Vienna, Austria.                                 for related recommendation tasks such as tag recom-
Copyright remains with the authors and/or original copyright holders.               mendations.
This has several benefits for real-world recommender sys-          or in an unfavourable way (a negative interaction). The set
tems. Because LightFM works well on both dense and sparse          of all user-item interaction pairs (u, i) ∈ U × I is the union
data, it obviates the need for building and maintaining mul-       of both positive S + and negative interactions S − .
tiple specialised machine learning models for each setting.           Users and items are fully described by their features. Each
Additionally, as it can use both user and item metadata, it        user u is described by a set of features fu ⊂ F U . The same
has the quality of being applicable in both item and user          holds for each item i whose features are given by fi ⊂ F I .
cold-start scenarios.                                              The features are known in advance and represent user and
   To allow others to reproduce the results in this paper, I       item metadata.
have released a Python implementation of LightFM1 , and               The model is parameterised in terms of d-dimensional user
made the source code for this paper and all the experiments        and item feature embeddings eU            I
                                                                                                    f and ef for each feature f .
available on Github2 .                                             Each feature is also described by a scalar bias term (bU  f for
                                                                   user and bIf for item features).
2.     LIGHTFM                                                        The latent representation of user u is given by the sum of
                                                                   its features’ latent vectors:
2.1      Motivation                                                                         qu =
                                                                                                  X U
                                                                                                       ej
  The structure of the LightFM model is motivated by two                                           j∈fu
considerations.
                                                                   The same holds for item i:
     1. The model must be able to learn user and item repre-                                pi =
                                                                                                   X
                                                                                                          eIj
        sentations from interaction data: if items described as                                    j∈fi
        ‘ball gown and ‘pencil skirt’ are consistently all liked
        by users, the model must learn that ball gowns are         The bias term for user u is given by the sum of the features’
        similar to pencil skirts.                                  biases:
                                                                                                 X U
                                                                                           bu =      bj
     2. The model must be able to compute recommendations
                                                                                                   j∈fu
        for new items and users.
                                                                   The same holds for item i:
I fulfil the first requirement by using the latent representa-                                     X
tion approach. If ball gowns and pencil skirts are both liked                               bi =          bIj
by the same users, their embeddings will be close together;                                        j∈fi

if ball gowns and biker jackets are never liked by the same          The model’s prediction for user u and item i is then given
users, their embeddings will be far apart.                         by the dot product of user and item representations, ad-
   Such representations allow transfer learning to occur. If       justed by user and item feature biases:
the representations for ball gowns and pencil skirts are simi-
lar, we can confidently recommend ball gowns to a new user                          rbui = f (q u · pi + bu + bi )            (1)
who has so far only interacted with pencil skirts.                 There is a number of functions suitable for f (·). An identity
   This is over and above what pure CB models using di-            function would work well for predicting ratings; in this pa-
mensionality reduction techniques (such as latent semantic         per, I am interested in predicting binary data, and so after
indexing, LSI) can achieve, as these only encode information       Rendle et al. [16] I choose the sigmoid function
given by feature co-occurrence rather than user actions. For
example, suppose that all users who look at items described                                            1
                                                                                       f (x) =               .
as aviators also look at items described as wayfarers, but                                       1 + exp(−x)
the two features never describe the same item. In this case,         The optimisation objective for the model consists in max-
the LSI vector for wayfarers will not be similar to the one        imising the likelihood of the data conditional on the param-
for aviators even though collaborative information suggests        eters. The likelihood is given by
it should be.                                                                               Y             Y
   I fulfil the second requirement by representing items and          L eU , eI , b U , b I =       rbui ×        (1 − rbui ) (2)
users as linear combinations of their content features. Be-                                 (u,i)∈S +           (u,i)∈S −
cause content features are known the moment a user or item
                                                                     I train the model using asynchronous stochastic gradient
enters the system, this allows recommendations to be made
                                                                   descent [14]. I use four training threads for experiments
straight away. The resulting structure is also easy to un-
                                                                   performed in this paper. The per-parameter learning rate
derstand. The representation for denim jacket is simply a
                                                                   schedule is given by Adagrad [6].
sum of the representation of denim and the representation
of jacket; the representation for a female user from the US        2.3    Relationship to Other Models
is a sum of the representations of US and female users.
                                                                     The relationship between LightFM and the collaborative
2.2      The Model                                                 MF model is governed by the structure of the user and item
                                                                   feature sets. If the feature sets consist solely of indicator
   To describe the model formally, let U be the set of users,
                                                                   variables for each user and item, LightFM reduces to the
I be the set of items, F U be the set of user features, and F I
                                                                   standard MF model. If the feature sets also contain meta-
the set of item features. Each user interacts with a number
                                                                   data features shared by more than one item or user, LightFM
of items, either in a favourable way (a positive interaction),
                                                                   extends the MF model by letting the feature latent factors
1
    https://github.com/lyst/lightfm/                               explain part of the structure of user interactions.
2
    https://github.com/lyst/lightfm-paper/                           This is important on three counts.
     1. In most applications there will be fewer metadata fea-     simplicity as its single optimisation objective is to factorise
        tures than there are users or items, either because        the user-item matrix.
        an ontology with a fixed type/category structure is           Shmueli et al. [20] represent items as linear combinations
        used, or because a fixed-size dictionary of most com-      of their features’ latent factors to recommend news articles;
        mon terms is maintained when using raw textual fea-        like LightFM, they use a single-objective approach and min-
        tures. This means that fewer parameters need to be es-     imise the user-item matrix reproduction loss. They show
        timated from limited training data, reducing the risk of   their approach to be successful in a modified cold-start set-
        overfitting and improving generalisation performance.      ting, where both metadata and data on other users who have
                                                                   commented on a given article is available. However, their ap-
     2. Latent vectors for indicator variables cannot be esti-     proach does not extend to modelling user features and does
        mated for new, cold-start users or items. Representing     not provide evidence on model performance in warm-start
        these as combinations of metadata features that can        scenario.
        be estimated from the training set makes it possible to       LightFM fits into the hybrid model tradition by jointly
        make cold-start predictions.                               factorising the user-item, item-feature, and user-feature ma-
     3. If only indicator features are present, LightFM should     trices. From a theory standpoint, it can be construed as a
        perform on par with the standard MF model.                 special case of Factorisation Machines [15].
                                                                      FMs provide an efficient method of estimating variable in-
   When only metadata features and no indicator variables          teraction terms in linear models under sparsity. Each vari-
are present, the model in general does not reduce to a pure        able is represented by a k-dimensional latent factor; the in-
content-based system. LightFM estimates feature embed-             teraction between variable i and j is then given by the dot
dings by factorising the collaborative interaction matrix; this    product of their latent factors. This has the advantage of
is unlike content-based systems which (when dimensionality         reducing the number of parameters to be estimated.
reduction is used) factorise pure content co-occurrence ma-           LightFM further restricts the interaction structure by only
trices.                                                            estimating the interactions between user and item features.
   One special case where LightFM does reduce to a pure            This aids the interpretability of resulting feature embed-
CB model is where each user is described by an indicator           dings.
variable and has interacted only with one item. In that
setting, the user vector is equivalent to a document vector in
the LSI formulation, and only features which occur together        4.    DATASETS
in product descriptions will have similar embeddings.                I evaluate LightFM’s performance on two datasets. The
   The fact that LightFM contains both the pure CB model           datasets span the range of dense interaction data, where
at the sparse data end of the spectrum and the MF model at         MF models can be expected to perform well (MovieLens),
the dense end suggests that it should adapt well to datasets       and sparse data, where CB models tend to perform better
of varying sparsity. In fact, empirical results show that it       (CrossValidated). Both datasets are freely available.
performs at least as well as the appropriate specialised model
in each scenario.                                                  4.1    MovieLens
                                                                      The first experiment uses the well-known MovieLens 10M
3.     RELATED WORK                                                dataset3 , combined with the Tag Genome tag set [22].
   There are a number of related hybrid models attempting             The dataset consists of approximately 10 million movie
to solve the cold-start problem by jointly modelling content       ratings, submitted by 71, 567 users on 10, 681 movies. All
and collaborative data.                                            movies are described by their genres and a list of tags from
   Soboroff et al. [21] represent users as linear combinations     the Tag Genome. Each movie-tag pair is accompanied by a
of the feature vectors of items they have interacted with.         relevance score (between 0 and 1), denoting how accurately
They then perform LSI on the resulting item-feature ma-            a given tag describes the movie.
trix to obtain latent user profiles. Representations of new           To binarise the problem, I treat all ratings below 4.0 (out
items are obtained by projecting them onto the latent fea-         of a 1 to 5 scale) as negative; all ratings equal to or above 4.0
ture space. The advantage of the model, relative to pure           are positive. I also filter out all ratings that fall below the
CB approaches, consists in using collaborative information         0.8 relevance threshold to retain only highly relevant tags.
encoded in the user-feature matrix. However, it models user           The final dataset contains 69, 878 users, 10, 681 items,
preferences as being defined over individual features them-        9, 996, 948 interactions, and 1030 unique tags.
selves instead of over items (sets of features). This is unlike
LightFM, where a feature’s effect in predicting an interac-        4.2    CrossValidated
tion is always taken in the context of all other features char-      The second dataset consists of questions and answers posted
acterising a given user-item pair.                                 on CrossValidated4 , a part of the larger network of Stack-
   Saveski et al. [18] perform joint factorisation of the user-    Exchange collaborative Q&A sites that focuses on statistics
item and item-feature matrices by using the same item latent       and machine learning. The dataset5 consists of 5953 users,
feature matrix in both decompositions; the parameters are          44, 200 questions, and 188, 865 answers and comments. Each
optimised by minimising a weighted sum of both matrices’           question is accompanied by one or more of 1032 unique tags
reproduction loss functions. A weight hyperparameter gov-          (such as ‘regression’ or ‘hypothesis-testing’). Additionally,
erns the relative importance of accuracy in decomposing the
                                                                   3
collaborative and content matrices. A similar approach is            http://grouplens.org/datasets/movielens/
                                                                   4
used by McAuley et al. [11] for jointly modelling ratings            http://stats.stackexchange.com
                                                                   5
and product reviews. Here, LightFM has the advantage of              https://archive.org/details/stackexchange
user metadata is available in the form of ‘About Me’ sections                              Table 1: Results
on users’ profiles.
   The recommendation goal is to match users with questions                                       CrossValidated       MovieLens
they can answer. A user answering a question is taken as                                          Warm        Cold   Warm      Cold
an implicit positive signal; all questions that a user has not
answered are treated as implicit negative signals. For the          LSI-LR                         0.662    0.660    0.686    0.690
training and test sets, I construct 3 negative training pairs       LSI-UP                         0.636    0.637    0.687    0.681
for each positive user-question pair by randomly sampling           MF                             0.541    0.508    0.762    0.500
from all questions that a given user has not answered.              LightFM (tags)                 0.675    0.675    0.744    0.707
   To keep the model simple, I focus on a user’s willingness        LightFM (tags + ids)           0.682    0.674    0.763    0.716
to answer a question rather than their ability, and forego          LightFM (tags + about)         0.695    0.696
modelling user expertise [17].

                                                                           obtained through projecting them onto the latent fea-
5.     EXPERIMENTAL SETUP                                                  ture space. The recommendations score for a user-item
   For each dataset, I perform two experiments. The first                  pair is then the inner product of their latent represen-
simulates a warm-start setting: 20% of all interaction pairs               tations.
are randomly assigned to the test set, but all items and
users are represented in the training set. The second is an             4. LightFM (tags): the LightFM model using only tag
item cold-start scenario: all interactions pertaining to 20%               features.
of items are removed from the training set and added to
the test set. This approximates a setting where the recom-              5. LightFM (tags + ids): the LightFM model using
mender is required to make recommendations from a pool of                  both tag and item indicator features.
items for which no collaborative information has been gath-             6. LightFM (tags + about): the LightFM model using
ered, and only content metadata (tags) are available.                      both item and user features. User features are avail-
   I measure model accuracy using the mean receiver operat-                able only for the CrossValidated dataset. I construct
ing characteristics area under the curve (ROC AUC) metric.                 them by converting the ‘About Me’ sections of users’
For an individual user, AUC corresponds to the probability                 profiles to a bag-of-words representation. I first strip
that a randomly chosen positive item will be ranked higher                 them of all HTML tags and non-alphabetical charac-
than a randomly chosen negative item. A high AUC score                     ters, then convert the resulting string to lowercase and
is equivalent to low rank-inversion probability, where the                 tokenise on spaces.
recommender mistakenly ranks an unattractive item higher
than an attractive item. I compute this metric for all users       In both LightFM (tags) and LightFM (tags + ids) users are
in the test set and average it for the final score.                described only by indicator features.
   I compute the AUC metric by repeatedly randomly split-             I train the LightFM models using stochastic gradient de-
ting the dataset into a 80% training set and a 20% test set.       scent with an initial learning rate of 0.05. The latent dimen-
The final score is given averaging across 10 repetitions.          sionality of the models is set to 64 for all models and exper-
   I test the following models:                                    iments. This setting is intended to reflect the balance be-
                                                                   tween model accuracy and the computational cost of larger
     1. MF: a conventional matrix factorisation model with         vectors in production systems (additional results on model
        user and item biases and a sigmoid link function [8].      sensitivity to this parameter are presented in Section 6.2).
                                                                   I regularise the model through an early-stopping criterion:
     2. LSI-LR: a content-based model. To estimate it, I           the training is stopped when the model’s performance on
        first derive latent topics from the item-feature matrix    the test set stops improving.
        through latent semantic indexing and represent items
        as linear combinations of latent topics. I then fit a
        separate logistic regression (LR) model for each user      6.     EXPERIMENTAL RESULTS
        in the topic mixture space. Unlike the LightFM model,
        which uses collaborative data to produce its latent rep-   6.1      Recommendation accuracy
        resentation, LSI-LR is purely based on factorising the       Experimental results are summarised in Table 1. LightFM
        content matrix. It should therefore be helpful in high-    performs very well, outperforming or matching the specialised
        lighting the benefit of using collaborative information    model for each scenario.
        for constructing feature embeddings.                         In the warm-start, low-sparsity case (warm-start Movie-
                                                                   Lens), LightFM outperforms MF slightly when using both
     3. LSI-UP: a hybrid model that represents user profiles       tag and item indicator features. This suggest that using
        (UP) as linear combinations of items’ content vectors,     metadata features may be valuable even when abundant in-
        then applies LSI to the resulting matrix to obtain la-     teraction data is present.
        tent user and item representations ([21], see Section        Notably, LightFM (tags) almost matches MF performance
        3). I estimate this model by first constructing a user-    despite using only metadata features. The LSI-LR and LSI-
        feature matrix: each row represents a user and is given    UP models using the same information fare much worse.
        by the sum of content feature vectors representing the     This demonstrates that (1) it is crucial to use collaborative
        items that user positively interacted with. I then ap-     information when estimating content feature embeddings,
        ply truncated SVD to the normalised matrix to obtain       and (2) LightFM can capture that information much more
        user and feature latent vectors; item latent vectors are   accurately than other hybrid models such as LSI-UP.
   In the warm-start, high-sparsity case (warm-start Cross-                          Table 2: Tag similarity
Validated), MF performs very poorly. Because user interac-
tion data is sparse (the CrossValidated user-item matrix is        Query tag      Similar tags
99.95% sparse vs only 99% for the MovieLens dataset), MF is
                                                                   ‘regression’   ‘least squares’, ‘multiple regression’, ‘re-
unable to learn good latent representations. Content-based
                                                                                  gression coefficients’, ‘multicollinearity’
models such as LSI-LR perform much better.
                                                                   ‘MCMC’         ‘BUGS’, ‘Metropolis-Hastings’,            ‘Beta-
   LightFM variants provide the best performance. LightFM
                                                                                  Binomial’, ‘Gibbs’, ‘Bayesian’
(tags + about) is by far the best model, showing the added
                                                                   ‘survival’     ‘epidemiology’, ‘Cox model’, ‘Kaplan-
advantage of LightFM’s ability to integrate user metadata
                                                                                  Meier’, ‘hazard’
embeddings into the recommendation model. This is likely
                                                                   ‘art house’    ‘pretentious’, ‘boring’, ‘graphic novel’,
due to improved prediction performance for users with little
                                                                                  ‘pointless’, ‘weird’
data in the training set.
                                                                   ‘dystopia’     ‘post-apocalyptic’, ‘futuristic’, ‘artificial in-
   Results for the cold-start cases are broadly similar. On
                                                                                  telligence’
the CrossValidated dataset, all variants of LightFM outper-
                                                                   ‘bond’         ‘007’, ‘secret service’, ‘nuclear bomb’, ‘spy-
form other models; LightFM (tags + about) again provides
                                                                                  ing’, ‘assassin’
the best performance. Interestingly, LightFM (tags + indi-
cators) outperforms LightFM (tags) slightly on the Movie-
Lens dataset, even though no embeddings can be estimated
                                                                  dimensional LSI-LR model even when using fewer than 32
for movies in the test set. This suggests that using both
                                                                  dimensions.
metadata and per-movie features allows the model to esti-
                                                                    This is an important win for large-scale recommender sys-
mate better embeddings for both, much like the use of user
                                                                  tems, where the choice of d is governed by a trade-off be-
and item bias terms allows better latent factors to be com-
                                                                  tween vector size and recommendation accuracy. Since smaller
puted. Unsurprisingly, MF performs no better than random
                                                                  vectors occupy less memory and use fewer computations dur-
in the cold-start case.
                                                                  ing query time, better representational power at small d al-
   In all scenarios the LSI-UP model performs no better than
                                                                  lows the system to achieve the same model performance at
the LSI-LR model, despite its attempt to incorporate col-
                                                                  a smaller computational cost.
laborative data. On the CrossValidated dataset it performs
strictly worse. This might be because its latent representa-      6.3    Tag embeddings
tions are estimated on less data than in LSI-LR: as there are        Feature embeddings generated by the LightFM model cap-
fewer users than items in the dataset, there are fewer rows       ture important information about the semantic relationships
in the user-feature matrix than in the item-feature matrix.       between different features. Table 2 gives some examples by
   The results confirm that LightFM encompasses both the          listing groups of tags similar (in the cosine similarity sense)
MF and the LSI-LR model as special cases, performing bet-         to a given query tag.
ter than the LSI-LR model in the sparse-data scenario and            In this respect, LightFM is similar to recent word em-
better than the MF model in the dense-data case. This             bedding approaches like word2vec and GloVe [12, 13]. This
means not only that a single model can be maintained in           is perhaps unsurprising, given that word embedding tech-
either settings, but also that the model will continue to         niques are closely related to forms of matrix factorisation
perform well even when the sparsity structure of that data        [9]. Nevertheless, LightFM and word embeddings differ in
changes.                                                          one important respect: whilst word2vec and GloVe embed-
   Good performance of LightFM (tags) in both datasets            dings are driven by textual corpus co-incidence statistics,
is predicated on the availability of high-quality metadata.       LightFM is based on user interaction data.
Nevertheless, it is often possible to obtain good quality meta-      LightFM embeddings are useful for a number of recom-
data from item descriptions (genres, actor lists and so on),      mendation tasks.
expert or community tagging (Pandora [23], StackOverflow ),
or computer vision systems where image or audio data is             1. Tag recommendation. Various applications use col-
available (we use image-based convolutional neural networks            laborative tagging as a way of generating richer meta-
for product tagging). In fact, the feature embeddings pro-             data for use in search and recommender system [2, 7].
duced by LightFM can themselves be used to assist the tag-             A tag recommender can enhance this process by either
ging process by suggesting related tags.                               automatically applying matching tags, or generating
                                                                       suggested tags lists for approval by users. LightFM-
6.2   Parameter Sensitivity                                            produced tag embeddings will work well for this task
                                                                       without the need to build a separate specialised model
  Figure 1 plots the accuracy of LightFM, LSI-LR, and LSI-             for tag recommendations.
UP against values of the latent dimensionality hyperparam-
eter d in the cold-start scenario (averaged over 30 runs of         2. Genre or category recommendation. Many do-
each algorithm). As d increases, each model is capable of              mains are characterised by an ontology of genres or
modelling more complex structures and achieves better per-             categories which play an important role in the presen-
formance.                                                              tation of recommendations. For example, the Netflix
  Interestingly, LightFM performs very well even with a                interface is organised in genre rows; for Lyst, fashion
small number of dimensions. In both datasets LightFM                   designers, categories and subcategories are fundamen-
consistently outperforms other models, achieving high per-             tal. The degree of similarity between the embeddings
formance with as few as 16 dimensions. On CrossValidated               of genres or categories provides a ready basis for genre
data, it achieves the same performance as the LSI-LR model             or category recommendations that respect the seman-
for much smaller d: it matches the accuracy of the 512-                tic structure of the ontology.
                                                          Figure 1: Latent dimension sensitivity


                   0.70                                                                        0.73


                   0.68                                                                        0.72


                   0.66                                                                        0.71
         ROC AUC


                                                                                     ROC AUC
                   0.64                                                                        0.70


                   0.62                                                                        0.69
                                               LSI-LR
                                               LSI-UP
                   0.60                        LightFM (tags)                                  0.68
                                               LightFM (tags + ids)
                                               LightFM (tags + about)
                   0.58                                                                        0.67
                          4   8     16    32       64   128   256       512                           4   8   16   32       64   128   256   512
                                               d                                                                        d


                                  (a) CrossValidated                                                          (b) MovieLens


     3. Recommendation justification. Rich information                         7.2         Feature engineering
        encoded in feature embeddings can help provide expla-                     Each of our products is described by a set of textual fea-
        nations for recommendations made by the system. For                    tures as well as structured metadata such as its type (dress,
        example, we might recommend a ball gown to a user                      shoes and so on) or designer. These are accompanied by
        who likes pencil skirts, and justify it by the two fea-                additional features coming from two sources.
        tures’ similarity as revealed by the distance between                     Firstly, we employ a team of experienced fashion modera-
        their latent factors.                                                  tors, helping us to derive more fine-grained features such as
                                                                               clothing categories and subcategories (peplum dress, halter-
7.     USAGE IN PRODUCTION SYSTEMS                                             neck and so on).
  The LightFM approach is motivated by our experience                             Secondly, we use machine learning systems for automatic
at Lyst. We have deployed LightFM in production, and                           feature detection. The most important of these is a set
successfully use it for a number of recommendation tasks. In                   of deep convolutional neural networks deriving feature tags
this section, I describe some of the engineering and algorithm                 from product image data.
choices that make this possible.
                                                                               7.3         Approximate nearest neighbour searches
7.1      Model training and fold-in                                               The biggest application of LightFM-derived item repre-
   Thousands of new items and users appear on Lyst every                       sentations are related product recommendations: given a
day. To cope with this, we train our LightFM model in                          product, we would like to recommend other highly relevant
an online manner, continually updating the representations                     products. To do this efficiently across 8 million products, we
of existing features and creating fresh representations for                    use a combination of approximate (for on-demand recom-
features that we have never observed before.                                   mendations) and exact (for near-line computation) nearest
   We store model state, including feature embeddings and                      neighbour search.
accumulated squared gradient information in a database.                           For approximate nearest neighbour (ANN) queries, we use
When new data on user interaction arrives, we restore the                      Random Projection (RP) trees [4, 5]. RP trees are a vari-
model state and resume training, folding in any newly ob-                      ant of random-projection [3] based locality sensitive hashing
served features. Since our implementation uses per-parameter                   (LSH).
diminishing learning rates (Adagrad), any updates of es-                          In LSH, k-bit hash codes for each point x are generated
tablished features will be incremental as the model adapts to                  by drawing random hyperplanes v, and then setting the k-th
new data. For new features, a high learning rate is used to                    bit of the hash code to 1 if x · v ≥ 0 and 0 otherwise. The
allow useful embeddings to be learned as quickly as possible.                  approximate nearest neighbours of x are then other points
   No re-training is necessary for folding in new products:                    that share the same hash code (or whose hash codes are
their representation can be immediately computed as the                        within some small Hamming distance of each other).
sum of the representations of their features.                                     While extremely fast, LSH has the undesirable property
                                                                               of sometimes producing very highly unbalanced distribution
                                                                               of points across all hash codes: if points are densely con-
                                                                               centrated, many codes of the tree will apply to no products
while some will describe a very large number of points. This      [5] S. Dasgupta and K. Sinha. Randomized partition trees
is unacceptable when building a production system, as it              for exact nearest neighbor search. arXiv preprint
will lead to many queries being very slow.                            arXiv:1302.1948, 2013.
   RP trees provide much better guarantees about the size         [6] J. Duchi, E. Hazan, and Y. Singer. Adaptive
of leaf nodes: at each internal node, points are split based          subgradient methods for online learning and stochastic
on the median distance to the chosen random hyperplane.               optimization. The Journal of Machine Learning
This guarantees that at every split approximately half the            Research, 12:2121–2159, 2011.
points will be allocated to each leaf, making the distribution    [7] R. Jäschke, L. Marinho, A. Hotho,
of points (and query performance) much more predictable.              L. Schmidt-Thieme, and G. Stumme. Tag
                                                                      recommendations in folksonomies. In Knowledge
8.     CONCLUSIONS AND FUTURE WORK                                    Discovery in Databases: PKDD 2007, pages 506–514.
 In this paper, I have presented an effective hybrid recom-           Springer, 2007.
mender model dubbed LightFM. I have shown the following:          [8] Y. Koren, R. Bell, and C. Volinsky. Matrix
                                                                      factorization techniques for recommender systems.
     1. LightFM performs at least as well as a specialised            Computer, (8):30–37, 2009.
        model across a wide range of collaborative data spar-     [9] O. Levy and Y. Goldberg. Neural word embedding as
        sity scenarios. It outperforms existing content-based         implicit matrix factorization. In Advances in Neural
        and hybrid models in cold-start scenarios where col-          Information Processing Systems, pages 2177–2185,
        laborative data is abundant or where user metadata is         2014.
        available.                                               [10] P. Lops, M. De Gemmis, and G. Semeraro.
                                                                      Content-based recommender systems: State of the art
     2. It produces high-quality content feature embeddings
                                                                      and trends. In Recommender systems handbook, pages
        that capture important semantic information about
                                                                      73–105. Springer, 2011.
        the problem domain, and can be used for related tasks
        such as tag recommendations.                             [11] J. McAuley and J. Leskovec. Hidden factors and
                                                                      hidden topics: understanding rating dimensions with
Both properties make LightFM an attractive model, appli-              review text. In Proceedings of the 7th ACM conference
cable both in cold- and warm-start settings. Nevertheless,            on Recommender systems, pages 165–172. ACM, 2013.
I see two promising directions in extending the current ap-      [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
proach.                                                               Efficient estimation of word representations in vector
   Firstly, the model can be easily extended to use more so-          space. arXiv preprint arXiv:1301.3781, 2013.
phisticated training methodologies. For example, an optimi-      [13] J. Pennington, R. Socher, and C. D. Manning. Glove:
sation scheme using Weighted Approximate-Rank Pairwise                Global vectors for word representation. Proceedings of
loss [24] or directly optimising mean reciprocal rank could           the Empiricial Methods in Natural Language
be used [19].                                                         Processing (EMNLP 2014), 12, 2014.
   Secondly, there is no easy way of incorporating visual or     [14] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A
audio features in the present formulation of LightFM. At              lock-free approach to parallelizing stochastic gradient
Lyst, we use a two-step process to address this: we first             descent. In Advances in Neural Information Processing
use convolutional neural networks (CNNs) on image data                Systems, pages 693–701, 2011.
to generate binary tags for all products, and then use the       [15] S. Rendle. Factorization machines. In Data Mining
tags for generating recommendations. We conjecture that               (ICDM), 2010 IEEE 10th International Conference
substantial improvements could be realised if the CNNs were           on, pages 995–1000. IEEE, 2010.
trained with recommendation loss directly.                       [16] S. Rendle, C. Freudenthaler, Z. Gantner, and
                                                                      L. Schmidt-Thieme. BPR: Bayesian personalized
9.     REFERENCES                                                     ranking from implicit feedback. In Proceedings of the
 [1] G. Adomavicius and A. Tuzhilin. Toward the next                  Twenty-Fifth Conference on Uncertainty in Artificial
     generation of recommender systems: A survey of the               Intelligence, pages 452–461. AUAI Press, 2009.
     state-of-the-art and possible extensions. Knowledge         [17] J. San Pedro and A. Karatzoglou. Question
     and Data Engineering, IEEE Transactions on,                      recommendation for collaborative question answering
     17(6):734–749, 2005.                                             systems with RankSLDA. In Proceedings of the 8th
 [2] M. Bastian, M. Hayes, W. Vaughan, S. Shah,                       ACM Conference on Recommender systems, pages
     P. Skomoroch, H. Kim, S. Uryasev, and C. Lloyd.                  193–200. ACM, 2014.
     Linkedin skills: large-scale topic extraction and           [18] M. Saveski and A. Mantrach. Item cold-start
     inference. In Proceedings of the 8th ACM Conference              recommendations: learning local collective
     on Recommender systems, pages 1–8. ACM, 2014.                    embeddings. In Proceedings of the 8th ACM
 [3] S. Dasgupta. Experiments with random projection. In              Conference on Recommender systems, pages 89–96.
     Proceedings of the Sixteenth conference on Uncertainty           ACM, 2014.
     in artificial intelligence, pages 143–151. Morgan           [19] Y. Shi, A. Karatzoglou, L. Baltrunas, M. Larson,
     Kaufmann Publishers Inc., 2000.                                  N. Oliver, and A. Hanjalic. CLiMF: learning to
 [4] S. Dasgupta and Y. Freund. Random projection trees               maximize reciprocal rank with collaborative
     and low dimensional manifolds. In Proceedings of the             less-is-more filtering. In Proceedings of the 6th ACM
     fortieth annual ACM symposium on Theory of                       onference on Recommender systems, pages 139–146.
     computing, pages 537–546. ACM, 2008.                             ACM, 2012.
[20] E. Shmueli, A. Kagian, Y. Koren, and R. Lempel.
     Care to Comment?: Recommendations for
     commenting on news stories. In Proceedings of the 21st
     international conference on World Wide Web, pages
     429–438. ACM, 2012.
[21] I. Soboroff and C. Nicholas. Combining content and
     collaboration in text filtering. In Proceedings of the
     IJCAI, volume 99, pages 86–91, 1999.
[22] J. Vig, S. Sen, and J. Riedl. The tag genome:
     Encoding community knowledge to support novel
     interaction. ACM Transactions on Interactive
     Intelligent Systems (TiiS), 2, 2012.
[23] T. Westergren. The music genome project. Online:
     http://pandora. com/mgp, 2007.
[24] J. Weston, S. Bengio, and N. Usunier. WSABIE:
     Scaling up to large vocabulary image annotation. In
     IJCAI, volume 11, pages 2764–2770, 2011.

</pre>