<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Metadata Embeddings for User and Item Cold-start Recommendations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maciej Kula Lyst maciej.kula@lyst.com</string-name>
        </contrib>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>I present a hybrid matrix factorisation model representing users and items as linear combinations of their content features' latent factors. The model outperforms both collaborative and content-based models in cold-start or sparse interaction data scenarios (using both user and item metadata), and performs at least as well as a pure collaborative matrix factorisation model where interaction data is abundant. Additionally, feature embeddings produced by the model encode semantic information in a way reminiscent of word embedding approaches, making them useful for a range of related tasks such as tag recommendations.</p>
      </abstract>
      <kwd-group>
        <kwd>Recommender Systems</kwd>
        <kwd>Cold-start</kwd>
        <kwd>Matrix Factorization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Building recommender systems that perform well in
coldstart scenarios (where little data is available on new users
and items) remains a challenge. The standard matrix
factorisation (MF) model performs poorly in that setting: it is
di cult to e ectively estimate user and item latent factors
when collaborative interaction data is sparse.</p>
      <p>
        Content-based (CB) methods address this by representing
items through their metadata [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As these are known in
advance, recommendations can be computed even for new
items for which no collaborative data has been gathered.
Unfortunately, no transfer learning occurs in CB models:
models for each user are estimated in isolation and do not
bene t from data on other users. Consequently, CB models
perform worse than MF models where collaborative
information is available and require a large amount of data on
each user, rendering them unsuitable for user cold-start [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
CBRecSys 2015, September 20, 2015, Vienna, Austria.
      </p>
      <p>Copyright remains with the authors and/or original copyright holders.</p>
      <p>At Lyst, solving these problems is crucial. We are a
fashion company aiming to provide our users with a convenient
and engaging way to browse|and shop|for fashion online.
To that end we maintain a very large product catalogue:
at the time of writing, we aggregate over 8 million fashion
items from across the web, adding tens of thousands of new
products every day.</p>
      <p>Three factors conspire to make recommendations
challenging for us. Firstly, our system contains a very large
number of items. This makes our data very sparse.
Secondly, we deal in fashion: often, the most relevant items
are those from newly released collections, allowing us only
a short window to gather data and provide e ective
recommendations. Finally, a large proportion of our users are
rsttime visitors: we would like to present them with compelling
recommendations even with little data. This combination of
user and item cold-start makes both pure collaborative and
content-based methods unsuitable for us.</p>
      <p>To solve this problem, I use a hybrid content-collaborative
model, called LightFM due to its resemblance to
factorisation machines (see Section 3). In LightFM, like in a
collaborative ltering model, users and items are represented
as latent vectors (embeddings). However, just as in a CB
model, these are entirely de ned by functions (in this case,
linear combinations) of embeddings of the content features
that describe each product or user. For example, if the movie
`Wizard of Oz' is described by the following features:
`musical fantasy', `Judy Garland', and `Wizard of Oz', then its
latent representation will be given by the sum of these
features' latent representations.</p>
      <p>In doing so, LightFM unites the advantages of
contentbased and collaborative recommenders. In this paper, I
formalise the model and present empirical results on two
datasets, showing that:
1. In both cold-start and low density scenarios, LightFM
performs at least as well as pure content-based models,
substantially outperforming them when either (1)
collaborative information is available in the training set
or (2) user features are included in the model.
2. When collaborative data is abundant (warm-start, dense
user-item matrix), LightFM performs at least as well
as the MF model.
3. Embeddings produced by LightFM encode important
semantic information about features, and can be used
for related recommendation tasks such as tag
recommendations.</p>
      <p>This has several bene ts for real-world recommender
systems. Because LightFM works well on both dense and sparse
data, it obviates the need for building and maintaining
multiple specialised machine learning models for each setting.
Additionally, as it can use both user and item metadata, it
has the quality of being applicable in both item and user
cold-start scenarios.</p>
      <p>To allow others to reproduce the results in this paper, I
have released a Python implementation of LightFM1, and
made the source code for this paper and all the experiments
available on Github2.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>LIGHTFM</title>
    </sec>
    <sec id="sec-3">
      <title>Motivation</title>
      <p>The structure of the LightFM model is motivated by two
considerations.</p>
      <p>1. The model must be able to learn user and item
representations from interaction data: if items described as
`ball gown and `pencil skirt' are consistently all liked
by users, the model must learn that ball gowns are
similar to pencil skirts.
2. The model must be able to compute recommendations
for new items and users.</p>
      <p>I ful l the rst requirement by using the latent
representation approach. If ball gowns and pencil skirts are both liked
by the same users, their embeddings will be close together;
if ball gowns and biker jackets are never liked by the same
users, their embeddings will be far apart.</p>
      <p>Such representations allow transfer learning to occur. If
the representations for ball gowns and pencil skirts are
similar, we can con dently recommend ball gowns to a new user
who has so far only interacted with pencil skirts.</p>
      <p>This is over and above what pure CB models using
dimensionality reduction techniques (such as latent semantic
indexing, LSI) can achieve, as these only encode information
given by feature co-occurrence rather than user actions. For
example, suppose that all users who look at items described
as aviators also look at items described as wayfarers, but
the two features never describe the same item. In this case,
the LSI vector for wayfarers will not be similar to the one
for aviators even though collaborative information suggests
it should be.</p>
      <p>I ful l the second requirement by representing items and
users as linear combinations of their content features.
Because content features are known the moment a user or item
enters the system, this allows recommendations to be made
straight away. The resulting structure is also easy to
understand. The representation for denim jacket is simply a
sum of the representation of denim and the representation
of jacket; the representation for a female user from the US
is a sum of the representations of US and female users.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>The Model</title>
      <p>To describe the model formally, let U be the set of users,
I be the set of items, F U be the set of user features, and F I
the set of item features. Each user interacts with a number
of items, either in a favourable way (a positive interaction),</p>
      <sec id="sec-4-1">
        <title>1https://github.com/lyst/lightfm/ 2https://github.com/lyst/lightfm-paper/</title>
        <p>or in an unfavourable way (a negative interaction). The set
of all user-item interaction pairs (u; i) 2 U I is the union
of both positive S+ and negative interactions S .</p>
        <p>Users and items are fully described by their features. Each
user u is described by a set of features fu F U . The same
holds for each item i whose features are given by fi F I .
The features are known in advance and represent user and
item metadata.</p>
        <p>The model is parameterised in terms of d-dimensional user
and item feature embeddings efU and eIf for each feature f .
Each feature is also described by a scalar bias term (bfU for
user and bIf for item features).</p>
        <p>The latent representation of user u is given by the sum of
its features' latent vectors:</p>
        <sec id="sec-4-1-1">
          <title>The same holds for item i: The bias term for user u is given by the sum of the features' biases:</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>The same holds for item i:</title>
          <p>qu =
pi =
bu =
bi =</p>
          <p>X ej</p>
          <p>U
j2fu
X ej</p>
          <p>I
j2fi
X bjU
j2fu
X bI</p>
          <p>j
j2fi</p>
          <p>
            The model's prediction for user u and item i is then given
by the dot product of user and item representations,
adjusted by user and item feature biases:
rbui = f (qu pi + bu + bi)
(1)
There is a number of functions suitable for f ( ). An identity
function would work well for predicting ratings; in this
paper, I am interested in predicting binary data, and so after
Rendle et al. [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] I choose the sigmoid function
f (x) =
          </p>
          <p>1
1 + exp( x)
:</p>
          <p>The optimisation objective for the model consists in
maximising the likelihood of the data conditional on the
parameters. The likelihood is given by</p>
          <p>L eU ; eI ; bU ; bI
=</p>
          <p>Y
(u;i)2S+
rbui</p>
          <p>Y
(u;i)2S
(1
rbui)
(2)</p>
          <p>
            I train the model using asynchronous stochastic gradient
descent [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. I use four training threads for experiments
performed in this paper. The per-parameter learning rate
schedule is given by Adagrad [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
2.3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Relationship to Other Models</title>
      <p>The relationship between LightFM and the collaborative
MF model is governed by the structure of the user and item
feature sets. If the feature sets consist solely of indicator
variables for each user and item, LightFM reduces to the
standard MF model. If the feature sets also contain
metadata features shared by more than one item or user, LightFM
extends the MF model by letting the feature latent factors
explain part of the structure of user interactions.</p>
      <p>This is important on three counts.
1. In most applications there will be fewer metadata
features than there are users or items, either because
an ontology with a xed type/category structure is
used, or because a xed-size dictionary of most
common terms is maintained when using raw textual
features. This means that fewer parameters need to be
estimated from limited training data, reducing the risk of
over tting and improving generalisation performance.
2. Latent vectors for indicator variables cannot be
estimated for new, cold-start users or items. Representing
these as combinations of metadata features that can
be estimated from the training set makes it possible to
make cold-start predictions.
3. If only indicator features are present, LightFM should
perform on par with the standard MF model.</p>
      <p>When only metadata features and no indicator variables
are present, the model in general does not reduce to a pure
content-based system. LightFM estimates feature
embeddings by factorising the collaborative interaction matrix; this
is unlike content-based systems which (when dimensionality
reduction is used) factorise pure content co-occurrence
matrices.</p>
      <p>One special case where LightFM does reduce to a pure
CB model is where each user is described by an indicator
variable and has interacted only with one item. In that
setting, the user vector is equivalent to a document vector in
the LSI formulation, and only features which occur together
in product descriptions will have similar embeddings.</p>
      <p>The fact that LightFM contains both the pure CB model
at the sparse data end of the spectrum and the MF model at
the dense end suggests that it should adapt well to datasets
of varying sparsity. In fact, empirical results show that it
performs at least as well as the appropriate specialised model
in each scenario.</p>
    </sec>
    <sec id="sec-6">
      <title>RELATED WORK</title>
      <p>There are a number of related hybrid models attempting
to solve the cold-start problem by jointly modelling content
and collaborative data.</p>
      <p>
        Soboro et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] represent users as linear combinations
of the feature vectors of items they have interacted with.
They then perform LSI on the resulting item-feature
matrix to obtain latent user pro les. Representations of new
items are obtained by projecting them onto the latent
feature space. The advantage of the model, relative to pure
CB approaches, consists in using collaborative information
encoded in the user-feature matrix. However, it models user
preferences as being de ned over individual features
themselves instead of over items (sets of features). This is unlike
LightFM, where a feature's e ect in predicting an
interaction is always taken in the context of all other features
characterising a given user-item pair.
      </p>
      <p>
        Saveski et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] perform joint factorisation of the
useritem and item-feature matrices by using the same item latent
feature matrix in both decompositions; the parameters are
optimised by minimising a weighted sum of both matrices'
reproduction loss functions. A weight hyperparameter
governs the relative importance of accuracy in decomposing the
collaborative and content matrices. A similar approach is
used by McAuley et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for jointly modelling ratings
and product reviews. Here, LightFM has the advantage of
simplicity as its single optimisation objective is to factorise
the user-item matrix.
      </p>
      <p>
        Shmueli et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] represent items as linear combinations
of their features' latent factors to recommend news articles;
like LightFM, they use a single-objective approach and
minimise the user-item matrix reproduction loss. They show
their approach to be successful in a modi ed cold-start
setting, where both metadata and data on other users who have
commented on a given article is available. However, their
approach does not extend to modelling user features and does
not provide evidence on model performance in warm-start
scenario.
      </p>
      <p>
        LightFM ts into the hybrid model tradition by jointly
factorising the user-item, item-feature, and user-feature
matrices. From a theory standpoint, it can be construed as a
special case of Factorisation Machines [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>FMs provide an e cient method of estimating variable
interaction terms in linear models under sparsity. Each
variable is represented by a k-dimensional latent factor; the
interaction between variable i and j is then given by the dot
product of their latent factors. This has the advantage of
reducing the number of parameters to be estimated.</p>
      <p>LightFM further restricts the interaction structure by only
estimating the interactions between user and item features.
This aids the interpretability of resulting feature
embeddings.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>DATASETS</title>
      <p>I evaluate LightFM's performance on two datasets. The
datasets span the range of dense interaction data, where
MF models can be expected to perform well (MovieLens),
and sparse data, where CB models tend to perform better
(CrossValidated). Both datasets are freely available.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>MovieLens</title>
      <p>
        The rst experiment uses the well-known MovieLens 10M
dataset3, combined with the Tag Genome tag set [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>The dataset consists of approximately 10 million movie
ratings, submitted by 71; 567 users on 10; 681 movies. All
movies are described by their genres and a list of tags from
the Tag Genome. Each movie-tag pair is accompanied by a
relevance score (between 0 and 1), denoting how accurately
a given tag describes the movie.</p>
      <p>To binarise the problem, I treat all ratings below 4:0 (out
of a 1 to 5 scale) as negative; all ratings equal to or above 4:0
are positive. I also lter out all ratings that fall below the
0:8 relevance threshold to retain only highly relevant tags.</p>
      <p>The nal dataset contains 69; 878 users, 10; 681 items,
9; 996; 948 interactions, and 1030 unique tags.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>CrossValidated</title>
      <p>The second dataset consists of questions and answers posted
on CrossValidated4, a part of the larger network of
StackExchange collaborative Q&amp;A sites that focuses on statistics
and machine learning. The dataset5 consists of 5953 users,
44; 200 questions, and 188; 865 answers and comments. Each
question is accompanied by one or more of 1032 unique tags
(such as `regression' or `hypothesis-testing'). Additionally,</p>
      <sec id="sec-9-1">
        <title>3http://grouplens.org/datasets/movielens/ 4http://stats.stackexchange.com 5https://archive.org/details/stackexchange</title>
        <p>user metadata is available in the form of `About Me' sections
on users' pro les.</p>
        <p>The recommendation goal is to match users with questions
they can answer. A user answering a question is taken as
an implicit positive signal; all questions that a user has not
answered are treated as implicit negative signals. For the
training and test sets, I construct 3 negative training pairs
for each positive user-question pair by randomly sampling
from all questions that a given user has not answered.</p>
        <p>
          To keep the model simple, I focus on a user's willingness
to answer a question rather than their ability, and forego
modelling user expertise [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTAL SETUP</title>
      <p>For each dataset, I perform two experiments. The rst
simulates a warm-start setting: 20% of all interaction pairs
are randomly assigned to the test set, but all items and
users are represented in the training set. The second is an
item cold-start scenario: all interactions pertaining to 20%
of items are removed from the training set and added to
the test set. This approximates a setting where the
recommender is required to make recommendations from a pool of
items for which no collaborative information has been
gathered, and only content metadata (tags) are available.</p>
      <p>I measure model accuracy using the mean receiver
operating characteristics area under the curve (ROC AUC) metric.
For an individual user, AUC corresponds to the probability
that a randomly chosen positive item will be ranked higher
than a randomly chosen negative item. A high AUC score
is equivalent to low rank-inversion probability, where the
recommender mistakenly ranks an unattractive item higher
than an attractive item. I compute this metric for all users
in the test set and average it for the nal score.</p>
      <p>I compute the AUC metric by repeatedly randomly
splitting the dataset into a 80% training set and a 20% test set.
The nal score is given averaging across 10 repetitions.</p>
      <p>
        I test the following models:
1. MF: a conventional matrix factorisation model with
user and item biases and a sigmoid link function [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
2. LSI-LR: a content-based model. To estimate it, I
rst derive latent topics from the item-feature matrix
through latent semantic indexing and represent items
as linear combinations of latent topics. I then t a
separate logistic regression (LR) model for each user
in the topic mixture space. Unlike the LightFM model,
which uses collaborative data to produce its latent
representation, LSI-LR is purely based on factorising the
content matrix. It should therefore be helpful in
highlighting the bene t of using collaborative information
for constructing feature embeddings.
3. LSI-UP: a hybrid model that represents user pro les
(UP) as linear combinations of items' content vectors,
then applies LSI to the resulting matrix to obtain
latent user and item representations ([
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], see Section
3). I estimate this model by rst constructing a
userfeature matrix: each row represents a user and is given
by the sum of content feature vectors representing the
items that user positively interacted with. I then
apply truncated SVD to the normalised matrix to obtain
user and feature latent vectors; item latent vectors are
5. LightFM (tags + ids): the LightFM model using
both tag and item indicator features.
6. LightFM (tags + about): the LightFM model using
both item and user features. User features are
available only for the CrossValidated dataset. I construct
them by converting the `About Me' sections of users'
pro les to a bag-of-words representation. I rst strip
them of all HTML tags and non-alphabetical
characters, then convert the resulting string to lowercase and
tokenise on spaces.
      </p>
      <p>In both LightFM (tags) and LightFM (tags + ids) users are
described only by indicator features.</p>
      <p>I train the LightFM models using stochastic gradient
descent with an initial learning rate of 0.05. The latent
dimensionality of the models is set to 64 for all models and
experiments. This setting is intended to re ect the balance
between model accuracy and the computational cost of larger
vectors in production systems (additional results on model
sensitivity to this parameter are presented in Section 6.2).
I regularise the model through an early-stopping criterion:
the training is stopped when the model's performance on
the test set stops improving.
6.
6.1</p>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-12">
      <title>Recommendation accuracy</title>
      <p>Experimental results are summarised in Table 1. LightFM
performs very well, outperforming or matching the specialised
model for each scenario.</p>
      <p>In the warm-start, low-sparsity case (warm-start
MovieLens), LightFM outperforms MF slightly when using both
tag and item indicator features. This suggest that using
metadata features may be valuable even when abundant
interaction data is present.</p>
      <p>Notably, LightFM (tags) almost matches MF performance
despite using only metadata features. The LSI-LR and
LSIUP models using the same information fare much worse.
This demonstrates that (1) it is crucial to use collaborative
information when estimating content feature embeddings,
and (2) LightFM can capture that information much more
accurately than other hybrid models such as LSI-UP.</p>
      <p>In the warm-start, high-sparsity case (warm-start
CrossValidated), MF performs very poorly. Because user
interaction data is sparse (the CrossValidated user-item matrix is
99.95% sparse vs only 99% for the MovieLens dataset), MF is
unable to learn good latent representations. Content-based
models such as LSI-LR perform much better.</p>
      <p>LightFM variants provide the best performance. LightFM
(tags + about) is by far the best model, showing the added
advantage of LightFM's ability to integrate user metadata
embeddings into the recommendation model. This is likely
due to improved prediction performance for users with little
data in the training set.</p>
      <p>Results for the cold-start cases are broadly similar. On
the CrossValidated dataset, all variants of LightFM
outperform other models; LightFM (tags + about) again provides
the best performance. Interestingly, LightFM (tags +
indicators) outperforms LightFM (tags) slightly on the
MovieLens dataset, even though no embeddings can be estimated
for movies in the test set. This suggests that using both
metadata and per-movie features allows the model to
estimate better embeddings for both, much like the use of user
and item bias terms allows better latent factors to be
computed. Unsurprisingly, MF performs no better than random
in the cold-start case.</p>
      <p>In all scenarios the LSI-UP model performs no better than
the LSI-LR model, despite its attempt to incorporate
collaborative data. On the CrossValidated dataset it performs
strictly worse. This might be because its latent
representations are estimated on less data than in LSI-LR: as there are
fewer users than items in the dataset, there are fewer rows
in the user-feature matrix than in the item-feature matrix.</p>
      <p>The results con rm that LightFM encompasses both the
MF and the LSI-LR model as special cases, performing
better than the LSI-LR model in the sparse-data scenario and
better than the MF model in the dense-data case. This
means not only that a single model can be maintained in
either settings, but also that the model will continue to
perform well even when the sparsity structure of that data
changes.</p>
      <p>
        Good performance of LightFM (tags) in both datasets
is predicated on the availability of high-quality metadata.
Nevertheless, it is often possible to obtain good quality
metadata from item descriptions (genres, actor lists and so on),
expert or community tagging (Pandora [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], StackOver ow ),
or computer vision systems where image or audio data is
available (we use image-based convolutional neural networks
for product tagging). In fact, the feature embeddings
produced by LightFM can themselves be used to assist the
tagging process by suggesting related tags.
6.2
      </p>
    </sec>
    <sec id="sec-13">
      <title>Parameter Sensitivity</title>
      <p>dimensional LSI-LR model even when using fewer than 32
dimensions.</p>
      <p>This is an important win for large-scale recommender
systems, where the choice of d is governed by a trade-o
between vector size and recommendation accuracy. Since smaller
vectors occupy less memory and use fewer computations
during query time, better representational power at small d
allows the system to achieve the same model performance at
a smaller computational cost.
6.3</p>
    </sec>
    <sec id="sec-14">
      <title>Tag embeddings</title>
      <p>Feature embeddings generated by the LightFM model
capture important information about the semantic relationships
between di erent features. Table 2 gives some examples by
listing groups of tags similar (in the cosine similarity sense)
to a given query tag.</p>
      <p>
        In this respect, LightFM is similar to recent word
embedding approaches like word2vec and GloVe [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. This
is perhaps unsurprising, given that word embedding
techniques are closely related to forms of matrix factorisation
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Nevertheless, LightFM and word embeddings di er in
one important respect: whilst word2vec and GloVe
embeddings are driven by textual corpus co-incidence statistics,
LightFM is based on user interaction data.
      </p>
      <p>LightFM embeddings are useful for a number of
recommendation tasks.</p>
      <p>
        1. Tag recommendation. Various applications use
collaborative tagging as a way of generating richer
metadata for use in search and recommender system [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ].
A tag recommender can enhance this process by either
automatically applying matching tags, or generating
suggested tags lists for approval by users.
LightFMproduced tag embeddings will work well for this task
without the need to build a separate specialised model
for tag recommendations.
2. Genre or category recommendation. Many
domains are characterised by an ontology of genres or
categories which play an important role in the
presentation of recommendations. For example, the Net ix
interface is organised in genre rows; for Lyst, fashion
designers, categories and subcategories are
fundamental. The degree of similarity between the embeddings
of genres or categories provides a ready basis for genre
or category recommendations that respect the
semantic structure of the ontology.
3. Recommendation justi cation. Rich information
encoded in feature embeddings can help provide
explanations for recommendations made by the system. For
example, we might recommend a ball gown to a user
who likes pencil skirts, and justify it by the two
features' similarity as revealed by the distance between
their latent factors.
      </p>
    </sec>
    <sec id="sec-15">
      <title>USAGE IN PRODUCTION SYSTEMS</title>
      <p>The LightFM approach is motivated by our experience
at Lyst. We have deployed LightFM in production, and
successfully use it for a number of recommendation tasks. In
this section, I describe some of the engineering and algorithm
choices that make this possible.
7.1</p>
    </sec>
    <sec id="sec-16">
      <title>Model training and fold-in</title>
      <p>Thousands of new items and users appear on Lyst every
day. To cope with this, we train our LightFM model in
an online manner, continually updating the representations
of existing features and creating fresh representations for
features that we have never observed before.</p>
      <p>We store model state, including feature embeddings and
accumulated squared gradient information in a database.
When new data on user interaction arrives, we restore the
model state and resume training, folding in any newly
observed features. Since our implementation uses per-parameter
diminishing learning rates (Adagrad), any updates of
established features will be incremental as the model adapts to
new data. For new features, a high learning rate is used to
allow useful embeddings to be learned as quickly as possible.</p>
      <p>No re-training is necessary for folding in new products:
their representation can be immediately computed as the
sum of the representations of their features.</p>
      <p>Each of our products is described by a set of textual
features as well as structured metadata such as its type (dress,
shoes and so on) or designer. These are accompanied by
additional features coming from two sources.</p>
      <p>Firstly, we employ a team of experienced fashion
moderators, helping us to derive more ne-grained features such as
clothing categories and subcategories (peplum dress,
halterneck and so on).</p>
      <p>Secondly, we use machine learning systems for automatic
feature detection. The most important of these is a set
of deep convolutional neural networks deriving feature tags
from product image data.
7.3</p>
    </sec>
    <sec id="sec-17">
      <title>Approximate nearest neighbour searches</title>
      <p>The biggest application of LightFM-derived item
representations are related product recommendations: given a
product, we would like to recommend other highly relevant
products. To do this e ciently across 8 million products, we
use a combination of approximate (for on-demand
recommendations) and exact (for near-line computation) nearest
neighbour search.</p>
      <p>
        For approximate nearest neighbour (ANN) queries, we use
Random Projection (RP) trees [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. RP trees are a
variant of random-projection [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] based locality sensitive hashing
(LSH).
      </p>
      <p>In LSH, k-bit hash codes for each point x are generated
by drawing random hyperplanes v, and then setting the k-th
bit of the hash code to 1 if x v 0 and 0 otherwise. The
approximate nearest neighbours of x are then other points
that share the same hash code (or whose hash codes are
within some small Hamming distance of each other).</p>
      <p>While extremely fast, LSH has the undesirable property
of sometimes producing very highly unbalanced distribution
of points across all hash codes: if points are densely
concentrated, many codes of the tree will apply to no products
while some will describe a very large number of points. This
is unacceptable when building a production system, as it
will lead to many queries being very slow.</p>
      <p>RP trees provide much better guarantees about the size
of leaf nodes: at each internal node, points are split based
on the median distance to the chosen random hyperplane.
This guarantees that at every split approximately half the
points will be allocated to each leaf, making the distribution
of points (and query performance) much more predictable.</p>
    </sec>
    <sec id="sec-18">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper, I have presented an e ective hybrid
recommender model dubbed LightFM. I have shown the following:
1. LightFM performs at least as well as a specialised
model across a wide range of collaborative data
sparsity scenarios. It outperforms existing content-based
and hybrid models in cold-start scenarios where
collaborative data is abundant or where user metadata is
available.
2. It produces high-quality content feature embeddings
that capture important semantic information about
the problem domain, and can be used for related tasks
such as tag recommendations.</p>
      <p>Both properties make LightFM an attractive model,
applicable both in cold- and warm-start settings. Nevertheless,
I see two promising directions in extending the current
approach.</p>
      <p>
        Firstly, the model can be easily extended to use more
sophisticated training methodologies. For example, an
optimisation scheme using Weighted Approximate-Rank Pairwise
loss [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] or directly optimising mean reciprocal rank could
be used [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>Secondly, there is no easy way of incorporating visual or
audio features in the present formulation of LightFM. At
Lyst, we use a two-step process to address this: we rst
use convolutional neural networks (CNNs) on image data
to generate binary tags for all products, and then use the
tags for generating recommendations. We conjecture that
substantial improvements could be realised if the CNNs were
trained with recommendation loss directly.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adomavicius</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuzhilin</surname>
          </string-name>
          .
          <article-title>Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on,
          <volume>17</volume>
          (
          <issue>6</issue>
          ):
          <volume>734</volume>
          {
          <fpage>749</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bastian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Skomoroch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Uryasev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          .
          <article-title>Linkedin skills: large-scale topic extraction and inference</article-title>
          .
          <source>In Proceedings of the 8th ACM Conference on Recommender systems</source>
          , pages
          <fpage>1</fpage>
          <article-title>{8</article-title>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          .
          <article-title>Experiments with random projection</article-title>
          .
          <source>In Proceedings of the Sixteenth conference on Uncertainty in arti cial intelligence</source>
          , pages
          <volume>143</volume>
          {
          <fpage>151</fpage>
          . Morgan Kaufmann Publishers Inc.,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Freund</surname>
          </string-name>
          .
          <article-title>Random projection trees and low dimensional manifolds</article-title>
          .
          <source>In Proceedings of the fortieth annual ACM symposium on Theory of computing</source>
          , pages
          <volume>537</volume>
          {
          <fpage>546</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Sinha</surname>
          </string-name>
          .
          <article-title>Randomized partition trees for exact nearest neighbor search</article-title>
          .
          <source>arXiv preprint arXiv:1302</source>
          .
          <year>1948</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Duchi</surname>
          </string-name>
          , E. Hazan, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          .
          <article-title>Adaptive subgradient methods for online learning and stochastic optimization</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2121</fpage>
          {
          <fpage>2159</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ja</surname>
          </string-name>
          schke, L.
          <string-name>
            <surname>Marinho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt-Thieme</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          .
          <article-title>Tag recommendations in folksonomies</article-title>
          .
          <source>In Knowledge Discovery in Databases: PKDD</source>
          <year>2007</year>
          , pages
          <fpage>506</fpage>
          {
          <fpage>514</fpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <article-title>Matrix factorization techniques for recommender systems</article-title>
          . Computer, (
          <volume>8</volume>
          ):
          <volume>30</volume>
          {
          <fpage>37</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <article-title>Neural word embedding as implicit matrix factorization</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2177</fpage>
          {
          <fpage>2185</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Gemmis</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Semeraro</surname>
          </string-name>
          .
          <article-title>Content-based recommender systems: State of the art and trends</article-title>
          .
          <source>In Recommender systems handbook</source>
          , pages
          <volume>73</volume>
          {
          <fpage>105</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>McAuley</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Hidden factors and hidden topics: understanding rating dimensions with review text</article-title>
          .
          <source>In Proceedings of the 7th ACM conference on Recommender systems</source>
          , pages
          <volume>165</volume>
          {
          <fpage>172</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP</source>
          <year>2014</year>
          ),
          <volume>12</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Recht</surname>
          </string-name>
          , C. Re,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wright</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Niu</surname>
          </string-name>
          .
          <article-title>Hogwild: A lock-free approach to parallelizing stochastic gradient descent</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>693</fpage>
          {
          <fpage>701</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          .
          <article-title>Factorization machines</article-title>
          .
          <source>In Data Mining (ICDM)</source>
          ,
          <year>2010</year>
          IEEE 10th International Conference on, pages
          <volume>995</volume>
          {
          <fpage>1000</fpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Schmidt-Thieme</surname>
          </string-name>
          .
          <article-title>BPR: Bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti cial Intelligence</source>
          , pages
          <fpage>452</fpage>
          {
          <fpage>461</fpage>
          . AUAI Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. San</given-names>
            <surname>Pedro</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          .
          <article-title>Question recommendation for collaborative question answering systems with RankSLDA</article-title>
          .
          <source>In Proceedings of the 8th ACM Conference on Recommender systems</source>
          , pages
          <volume>193</volume>
          {
          <fpage>200</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Saveski</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mantrach</surname>
          </string-name>
          .
          <article-title>Item cold-start recommendations: learning local collective embeddings</article-title>
          .
          <source>In Proceedings of the 8th ACM Conference on Recommender systems</source>
          , pages
          <volume>89</volume>
          {
          <fpage>96</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. Hanjalic.</surname>
          </string-name>
          <article-title>CLiMF: learning to maximize reciprocal rank with collaborative less-is-more ltering</article-title>
          .
          <source>In Proceedings of the 6th ACM onference on Recommender systems</source>
          , pages
          <volume>139</volume>
          {
          <fpage>146</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Shmueli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kagian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lempel</surname>
          </string-name>
          . Care to Comment?:
          <article-title>Recommendations for commenting on news stories</article-title>
          .
          <source>In Proceedings of the 21st international conference on World Wide Web</source>
          , pages
          <volume>429</volume>
          {
          <fpage>438</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>I.</given-names>
            <surname>Soboro</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Nicholas</surname>
          </string-name>
          .
          <article-title>Combining content and collaboration in text ltering</article-title>
          .
          <source>In Proceedings of the IJCAI</source>
          , volume
          <volume>99</volume>
          , pages
          <fpage>86</fpage>
          {
          <fpage>91</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Vig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sen</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Riedl.</surname>
          </string-name>
          <article-title>The tag genome: Encoding community knowledge to support novel interaction</article-title>
          .
          <source>ACM Transactions on Interactive Intelligent Systems (TiiS)</source>
          ,
          <volume>2</volume>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Westergren</surname>
          </string-name>
          .
          <article-title>The music genome project</article-title>
          . Online: http://pandora. com/mgp,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          . WSABIE:
          <article-title>Scaling up to large vocabulary image annotation</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <volume>11</volume>
          , pages
          <fpage>2764</fpage>
          {
          <fpage>2770</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>