=Paper= {{Paper |id=Vol-2319/paper18 |storemode=property |title=A Multimodal Recommender System for Large-scale Assortment Generation in e-Commerce |pdfUrl=https://ceur-ws.org/Vol-2319/paper18.pdf |volume=Vol-2319 |authors=Murium Iqbal,Adair Kovac,Kamelia Aryafar |dblpUrl=https://dblp.org/rec/conf/sigir/IqbalKA18 }} ==A Multimodal Recommender System for Large-scale Assortment Generation in e-Commerce== https://ceur-ws.org/Vol-2319/paper18.pdf

A Multimodal Recommender System for Large-scale
Assortment Generation in E-commerce
Murium Iqbal Adair Kovac Kamelia Aryafar
Overstock Overstock Overstock
Midvale, Utah Midvale, Utah Midvale, Utah
miqbal@Overstock.com akovac@Overstock.com karyafar@Overstock.com
ABSTRACT
E-commerce platforms surface interesting products largely through
product recommendations that capture users’ styles and aesthetic
preferences. Curating recommendations as a complete complemen-
tary set, or assortment, is critical for a successful e-commerce ex-
perience, especially for product categories such as furniture, where
items are selected together with the overall theme, style or ambiance
of a space in mind. In this paper, we propose two visually-aware
recommender systems that can automatically curate an assortment
of living room furniture around a couple of pre-selected seed pieces
for the room. The first system aims to maximize the visual-based
style compatibility of the entire selection by making use of transfer
learning and topic modeling. The second system extends the first by
incorporating text data and applying polylingual topic modeling to
infer style over both modalities. We review the production pipeline
Figure 1: An automatically generated assortment from the multimodal
for surfacing these visually-aware recommender systems and com-
approach is shown.
pare them through offline validations and large-scale online A/B
tests on Overstock 1 . Our experimental results show that compli-
mentary style is best discovered over product sets when both visual 1 INTRODUCTION
and textual data are incorporated. Overstock 1 is an e-commerce platform with the goal of creating
dream homes for all. Users browse Overstock’s catalog to select
CCS CONCEPTS pieces that complement one another while matching the stylistic
•Information systems → Recommender systems; Collaborative settings and color palettes of their rooms. As furniture is not a dis-
filtering; Presentation of retrieval results; •Computing methodolo- posable product, furniture purchases are subject to careful scrutiny
gies → Neural networks; of aesthetics and strict budgeting. Brick and mortar stores often
inspire consumers by creating furniture showrooms, in some cases
pushing consumers to walk through their carefully selected furniture
KEYWORDS
displays. These assorted showrooms alleviate the creative and stylis-
recommender system, visual document, topic modeling, set recom- tic pressure on the consumer, since the set is already on display with
mendation, quadratic knapsack problem, product recommendation all the necessary pieces.
By crafting an appropriate recommender system, this experience
ACM Reference format:
can be recreated on an e-commerce platform. Figure 1 illustrates
Murium Iqbal, Adair Kovac, and Kamelia Aryafar. 2018. A Multimodal Rec-
ommender System for Large-scale Assortment Generation in E-commerce. In
an example of the proposed system’s automatically generated show-
Proceedings of ACM SIGIR Workshop on eCommerce, Ann Arbor, Michigan, room, built around two seed products selected from Overstock 1 .
USA, July 2018 (SIGIR 2018 eCom), 9 pages. The goal of this recommender system is to provide set recommenda-
DOI: 10.1145/nnnnnnn.nnnnnnn tions that adhere to a general theme or cohesive visual style while
accounting for essential item constraints 2 . Set recommendations
have become more prominent with the rise of subscription box ser-
1 www.overstock.com vices in various domains such as fashion (e.g. StitchFix 3 ), jewelry
(e.g. Rocksbox 4 ) and beauty products (e.g. Birchbox 5 ). The criteria
for selecting an assortment depends heavily on the product space
Permission
Copyright to make
© 2018 by the digital or hard Copying
paper’s authors. copies permitted
of part orforall of this
private andwork for purposes.
academic personal or which can inform the latent space of user preferences and product
In: J. Degenhardt,
classroom use is G. Di Fabbrizio,
granted withoutS.feeKallumadi,
providedM.that
Kumar,
copies Y.-C.
areLin,
notA.made
Trotman, H. Zhao
or distributed
(eds.): Proceedings of the SIGIR 2018 eCom 2 Essential item constraints refers to specific products which must be included in an
for profit or commercial advantage andworkshop, 12 bear
that copies July, 2018, Ann Arbor,
this notice and Michigan, USA,
the full citation
published at http://ceur-ws.org assortment, e.g. a bed frame in a bedroom set or a vendor must-have in a subscription
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s). box.
3 www.stitchfix.com
SIGIR 2018 eCom, Ann Arbor, Michigan, USA
4 www.rocksbox.com
© 2018 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn 5 www.birchbox.com
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Murium Iqbal, Adair Kovac, and Kamelia Aryafar

specifically Latent Dirichlet Allocation (LDA)[3], has also been
applied to create topic based recommendations both off of text input
as well as from implicit feedback [11] [2] [26]. These systems
work by scoring individual products against one another or against
users to see which are most similar. Here we look to extend this
methodology by applying topic modeling over both image and text
data. Rather than simply concatenating image and text features and
linearly combining them, PolyLDA enables us to learn two distinct
but coupled latent style representations. This allows for a versatile
interpretation of style [13]. We then use these learned styles to make
bundle or assortment recommendations rather than the traditional
single item recommendations.
To facilitate style discovery from the images, we process them
Figure 2: A system overview diagram is presented above. We use prod-
uct data and user engagement data residing within our Hadoop cluster
via prevalent deep learning techniques. Deep residual networks have
and our NFS/S3 store to train the model. Most of the training is done recently been shown as a powerful model to capture style based
on a one box GPU server, which runs the Resnet-50, Mallet and our preferences for creating visual wardrobes from fashion images [10].
greedy algorithm. All results are served to Overstock 1 users from Cas- With the goal of learning visually coherent styles, we apply trans-
sandra. fer learning, by using a Resnet-50 6 which was pre-trained on
ImageNet 7 [8]. We explore interpreting the convolutional neural
network by indexing the activations on channels within the convolu-
representations. Personal care product recommendations, for ex- tional layers [19]. We use a neural network to learn filters for our
ample, rely on user preferences, such as skin type, which can be data and simply index their responses to images to create visual doc-
inferred from product text-based attributes. Fashion, jewelry and uments, similar to the older bag of word methods [24]. To discover
furniture shopping, on the other hand, are predominantly visual expe- visually-aware trends, we use these documents with Latent Dirichlet
riences. This motivates a visually-aware representation for products Allocation (LDA) [3].
and style-based preferences for users. While we use LDA for topic modeling on single-modality image
We propose two assortment recommender systems. The first data, we need an extension to interpret both visual semantic features
takes advantage of visually semantic features transferred from a and text-based attribute data as complementary modalities. Roller
deep convolutional neural network to learn style. The second uses et al. [20] create a multimodal topic model that assumes words and
both these visual features and product text-based attributes to learn corresponding visual features occur in pairs and should be captured
style across the multimodal dataset. Our hypothesis is that while as tuples within the topic distribution. This model would work well
the visual style can help find similar products, use of text data in for descriptions of images, but the underlying assumption could
conjunction with images will result in more complimentary, and prove to be too stringent to generalize to our application. Mimno et
cohesive stylized assortments. al. [17] offer a more flexible extension of LDA, Polylingual LDA
(PolyLDA), which handles tuples of documents in different lan-
2 RELATED WORK guages with assumed identical topic distributions. The documents
As the prevalence of recommendation systems grows, e-commerce within a tuple do not need to be direct translations of one another,
platforms are looking to increase the influence on customer pur- and the topics themselves have distinct sets of words for each lan-
chases by not just recommending one item at a time but several in a guage. By treating our image data and our text data as two separate
bundle together [25] [27] [1] [6]. This is true across industries, from languages, we can use PolyLDA directly to create a more flexible
travel agencies to clothing [10] [14] [15]. Specifically Zhu et al.[27] multimodal topic model with which to infer style.
model the problem of bundle recommendations as the Quadratic Hsiao and Grauman [10] also apply PolyLDA to infer style from
Knapsack Problem (QKP), and find an approximation to it. They images, but they use complementary clothing types (e.g. pants
use only implicit feedback data, rather than content data to build and blouse) as languages in a compatibility model instead of using
their system, allowing for the bundles to be personalized. Using this PolyLDA to handle different modalities of data for the same item.
approach, we also view our system as a QKP, but look to leverage The documents they use for their topic models are generated using a
primarily content data, incorporating implicit feedback as a minor Resnet trained to automatically detect their pre-defined text attributes
signal. To capture a user’s preference, the system we propose allows within images and apply the appropriate labels.
a user to present a seed product, around which we can create the Much of the work performed on traditional recommendation sys-
bundle. This allows us to circumvent the cold start problem, as tems can be applied here, but additional constraints must be taken
any user, regardless of having a history on our site, can still build into consideration. Products can’t simply be visually similar to be
assortment recommendations. New products are also seamlessly purchased together, but must be complimentary as well. For exam-
incorporated, as long as images or text data is made available. ple, having two similar brightly colored upholstered chairs as a result
Recommendation systems is a well studied field which has two in a traditional recommendation is perfectly acceptable, but if they
large over arching types: collaborative filtering methods [18] [12]
[7], and content based methods [4] [21] [22] of which many rely 6 https://github.com/facebook/fb.resnet.torch

on learning latent representation of the input data. Topic modeling, 7 http://www.image-net.org/
Assortment Recommendations SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

are both taken into one assortment with the intention of being bought
together, they may clash. Here we show that by incorporating text
data, we are able to avoid assortments of incredibly visually similar,
interchangeable products which are not at all complimentary. These
nuances can be further addressed by incorporating purchase data to
infer what types of products are not just similar, but also compatible,
and likely to be bought together. To create the final assortment rec-
ommendations, we define a distance measure that combines stylistic
similarity with implicit feedback from users, in the form of pur-
chases, to develop a concept of affinity similar to approaches in
composite recommendation systems such as [9, 10, 23, 27].
This rest of this paper is organized as follows: In Section 3, we
present our production pipeline for assortment recommendations as
deployed in Overstock 1 and explore our product embeddings as used
with topic modeling techniques. We also review our formulation of Figure 3: Here we depict activations from the layers used to create our
assortment generation as a 0-1 quadratic knapsack problem and our visual documents. The leftmost column is the raw image. The next col-
greedy optimization for building assortments. Section 4 presents umn to the right is the cropped patch which is passed through the net-
our offline and online experimental results. Finally, we conclude our work. In the next 4 columns, we see the response to these patches from
paper in Section 5 and propose future directions. 9 channels each of convolutional layers 8, 18, 31 and 43. Lighter col-
ors mean strong responses, while darker colors mean weak responses.
We threshold these responses to determine whether or not the "word"
3 METHODOLOGY each channel represents is present within the image. This methodology
Here we provide an overview of the production system used to seems to work particularly well for our upholstered furniture, showing
surface recommendations throughout Overstock 1 and describe our distinct responses to the different patterns.
assortment recommender system in detail. We first explain our
product embeddings as text-based bag-of-words and bag-of-visual-
words (BoVW) documents. Then we explain the LDA approaches
which products have an associated assortment to present to users.
used to discover text-based and visual styles across our platform.
Assortment recommendations are featured on corresponding product
We finally discuss our formulation of assortment generation as a 0-1
pages, below the product description, providing an option for users
quadratic knapsack problem with budget constraints and our greedy
to purchase an entire assortment, or just parts of it. A brief diagram
assortment recommendations algorithm.
of the systems involved is presented in Figure 2.
3.1 System Overview
3.2 Product Embeddings
Product recommendations on Overstock 1 are surfaced through mul-
tiple carousels across the platform. User interactions with products Products on Overstock 1 have several images of the item for sale and
and recommendations is recorded along side purchase data. Each different forms of associated text data including title, description and
product on the platform is represented by corresponding title, text attribute tags. Product attributes are descriptive tags associated with
attribute tags, text-based descriptions and images. The first step a product. Some example attribute categories are color, size, style,
in generating recommendations is processing the large volume of composing materials, finish, and brand. In this paper, we utilize
product and user data which are hosted on Hadoop Servers. these attributes as the primary text-based information since they
After initial processing on Hadoop, the data is transfered to a one- often provide a rich text representation of the item that compliments
box CUDA-enabled GPU server. Here product images are passed provided images. All product attribute data is processed to remove
through a publicly available Resnet-50 to create image-based stop words and are concatenated with each product title to form a
documents. Both the image and text documents are then fed into bag-of-words document.
Mallet 8 to generate topic representations of products. We create We use the methodology described in [13] to create visual doc-
two topic variants, a visual variant, that uses the image data with uments to be used with LDA via a Resnet. We then combine both
LDA and a multimodal variant, that uses both text and image data visual and text data together via PolyLDA.
with PolyLDA. The convolutional layers within a Resnet are composed of a series
Once the topic distributions are created, the products reside within of small learned filters. Each filter is convolved with the input in
their defined feature space. Hereafter, the methodology for both vari- steps along it’s height and width. The result of this process is a
ants is identical. We construct a distance measure over the topic 2-dimensional grid of activations that represents the response from
space, using implicit feedback from our users, to model compatibil- the filter at corresponding locations of the input. The learned filters
ity. The assortments are then generated by finding products which within a Resnet respond to specific patterns. By viewing these
minimize this distance to predefined seed product pairs. filters as words, and indexing them, Iqbal et al. [13] show that a
The generated assortments are pushed to our Cassandra database BoVW document can be created which, when combined with LDA,
cluster. Our website servers can query Cassandra directly to see is effective at uncovering style. To index a layer, we simply threshold
the response from the filter to indicate whether or not it is sufficiently
8 http://mallet.cs.umass.edu/ activated by the image to be included within the image’s document.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Murium Iqbal, Adair Kovac, and Kamelia Aryafar

For the multimodal variant, each visual document is paired with
it’s text equivalent in a tuple. These tuples are then used with Polylin-
gual LDA (PolyLDA) [17]. Products that are missing either a text
document or a visual document can still be used with PolyLDA, as
it is robust to missing documents within tuples. A diagram of the
system for multimodal document creation is provided in Figure 4.
PolyLDA is an extension of LDA meant to handle loosely equivalent
documents in different languages. It takes as input tuples of docu-
ments, each in a different language but with the same exact topic
distribution. Each resulting latent topic from this method contains a
distinct associated set of word distributions for each language. Docu-
Figure 4: Each product’s corresponding text and images are collected. ments aren’t required to be direct translations of one another, which
Our system then discards images with noisy scenes, keeping only those allows for flexibility. A word in one language could be attributed
with white backgrounds and passing them through Resnet-50. Chan- to a given topic while its direct translation in another language may
nel activations from layers 8,18,31, and 43 are thresholded and indexed be attributed to a different topic. Words that only appear in one
to create visual documents. The union of these documents is taken as language can also be present in topics despite having no equivalent
the visual representation of the product. The text data is stripped of translations in the corpus. PolyLDA extends LDA by assuming the
stopwords and paired with the corresponding visual document to cre- following generative procedure:
ate a tuple for ingestion by PolyLDA
(1) For a given tuple of documents, {d 1 . . . d L }, initialize a
single set of topic distributions, θ ∼ Dir(α)
Given this methodology for creating image documents, our visual (2) For K topic sets with L languages, initialize the set of word
word vocabulary is defined entirely by the channels in the layers that distributions, ϕ k,l ∼ Dir(β) for {k = 1 . . . K } and {l =
we choose to index. We tested creating documents using various 1 . . . L}
layers through the network. We found that combining several middle (3) For the k th word in the l th document in the i th tuple select
layers gave us the best results. After experimenting we empirically a topic zi,l from θ i and a word, w i,l,k , from ϕ zi,l
chose to use Resnet-50 layers 8, 18, 31, and 43. The combination For our multimodal variant, we make use of the flexibility af-
of these layers provided significantly better results, per visual inspec- forded by PolyLDA by applying it to our visual and text documents,
tion, than any single layer or other combinations. This yielded a total representing each modality as a distinct language attempting to de-
vocabulary size of 2816 visual words over our image documents. scribe the same product. This method should allow for topics that
Since multiple images can be provided for each product, we take the afford better generality. Intuitively for our corpus, we suspect that
union of all present visual words within the corresponding images as a topic could capture relationships between certain visual features,
the visual document for the product itself. The process is depicted like thick wood, with textual attributes, like "quality", that don’t
in Figure 4. have a direct visual representation but do share an underlying con-
textual relationship. By combining modalities the model also affords
3.3 Topic Modeling us the ability to infer a more complete style representation given
Once the visual documents are created, they can be used directly with incomplete data. For example, we can apply our multimodal topics
LDA to create topics for our visual variant. LDA is an unsupervised to products with only text attributes available and still infer style that
generative model which infers probability distributions of words would normally be captured by the missing image data.
within topics and models documents as a mixture of those topics We provide visualizations of the resulting topics from both vari-
[3]. Given a set of documents and a number of target topics the ants in Figure 5. To create these visualizations we selected the
model assumes the following generative process was used to create products that are maximally aligned with each topic. In our case,
the documents. we have empirically seen that the topics relate well to various styles
(1) For M documents, initialize the set of topic distributions, while reducing our feature space from a large number of attribute
θ i ∼ Dir(α) for {i = 1 . . . M} tags, title words, and image-based channel activations to a succinct
(2) For K topics, initialize the set of word distributions, ϕ k ∼ k-dimensional space (where k is the number of topics). These styles
Dir(β) for {k = 1 . . . K} are not necessarily complimentary though, especially in the case of
(3) For the k t h word in the i t h document select a topic zi from the visual topics. We can see that the items within the topics seem
θ i and a word, w i,k , from ϕ zi more substitutable, e.g. in the cases of brightly colored furniture,
similarly patterned upholstery, and mirrored surfaces.
where Dir is the Dirichlet distribution.
The model initializes the parameters randomly then iteratively
updates these parameters via Gibbs Sampling and variational Bayes 3.4 Knapsack Problem
inference. After many iterations the topic distributions converge to Once we have all our products residing within the topics’ feature
a stable state and the resulting topics can be used as a low dimen- space, we can begin to build out our assortments. The process for
sional feature space that captures the salient content of the original generating the assortments is the same for both the visual variant and
documents. the multimodal variant, with the only difference between the two
Assortment Recommendations SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

Figure 5: Representative items from each of six topics in the visual variant on the left (a) and multimodal variant on the right (b) are depicted above.

being the topic representations of products. To build these assort- also have an additional associated pair-wise profit when selected
ments, we first define the necessary components of an assortment as together for the knapsack.
verticals. The verticals we define here enable us to build living room For a given seed, we can find the optimal assortments by trying to
assortments, but we would like to emphasize that verticals can be find the products that are nearest the seed in the topic space. We can
defined for any e-commerce platform. If desired, the verticals can add the additional constraint that we want all products within the
be left undefined, and the assortment can be built with no vertical assortment to be nearest one another in the topic space as well. The
constraints. For our specific application verticals are critical pieces total assortment must remain within the assumed budget constraint.
of living room furniture, e.g. coffee table, chair, or accent table. Our Each product has an associated vertical, and we can add in some
definitions are manually applied to the products and are listed below. constraints on the minimum and maximum number of products each
(Here a couch set can be either a sectional sofa or a sofa and loveseat vertical can contribute to the final assortment.
combination.) To calculate proximity we use a Mahalanobis distance built on
purchase data similar to [16]. The Mahalanobis distance allows us
to leverage the implicit feedback of our users to understand how our
Verticals: {Couch Set, Coffee Table, Accent Table, Entertainment topics distribute and relate to one another across multiple products
Center, Bookshelf, Ottoman, Chair} within a purchase.
We assume that purchases of living room furniture made by the
We chose to build our assortments around seed pieces. For our same user over a 3 month period are being used in the same room,
current application, we generated seeds, but we can easily allow and as such can be said to represent a compatible assortment. After
users to provide their own seeds around which we can curate an as- trimming to include only purchases with 3-10 pieces of living room
sortment. We selected seeds as preferred pairings of the most crucial furniture, so as to avoid noise and erroneous signals from bulk pur-
verticals in any living room space, namely a couch set and a coffee chasers, we create a dataset of roughly 7.5k purchased assortments.
table. Although all the other verticals can be seen as optional, these We take the L 2 normalized sum of the topic distributions of the co-
two verticals are defining members of a living room and therefore purchased products as a topic representation of the purchase itself,
must be present in any assortment. We chose pairs of sofas and and use this dataset to calculate the covariance matrix used for our
coffee tables which are most frequently co-clicked for our seeds. Mahalanobis distance.
We then assume a budget constraint for the entire assortment. As The Mahalanobis distance is defined as d M (x i ,x j ) = (x i -x j )M(x i -
such we can formulate our assortment generation as a 0-1 quadratic x j )T , Where M is the covariance matrix for the topics learned from
knapsack problem (0-1 QKP) as defined by Gallo et al [5]. The 0-1 our purchase data. For each seed pair of coffee table and couch set,
QKP assumes that a knapsack with limited space needs to be filled we greedily swap products in and out of our assortment until we
with a set of items. These items all have an intrinsic value. Items
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Murium Iqbal, Adair Kovac, and Kamelia Aryafar

Algorithm 1 Generate Assortments the pool of products that are labeled as candidates for this vertical.
δ ←ϵ +1 Those which are closest to the total assortment (minus the current
2: Ai,pr ev ← {}∀i∈ {V−S} vertical) at the previous time-step are added in, until the size of the
Ai,pr ev ←{pi }∀i∈ {S} vertical is met. The sizes of the verticals in our experiment were
4: while δ ≥ ϵ do set by us, but a user can dictate how many side tables, bookshelves,
δ ←0 chairs, etc they wish to add to their assortment.
6: for i ∈ {V−S} do Let V be the set of all verticals. S is the set of verticals used in the
t A ← k( tp )k ∀p j ∈ A j,pr ev , ∀j ∈ V , i
Í seed. Ai is the set of products selected for the assortment in vertical i
8: while size(Ai ) < i size do for the current timestep. Each vertical i has an associated size, i size ,
Ai ← Ai ∪ argmin(dM (to , t A )) which is the number of products we want for this vertical. Ai,pr ev
10: δ ← δ + d M (t Ai , t Ai,pr ev ) is the set of products selected for the assortment in vertical i for the
previous timestep. pi is an element of Ai,pr ev . oi is a candidate
product for vertical i. to is the topic distribution of a product. d M is
the Mahalanobis distance as defined above. δ is the change in the
converge to an ideal set of products. The only verticals which are distance of the assortment from iteration to iteration. The greedy
not allowed to change are those in the seed. approach is described using this notation in Algorithm 1. Figure 6
To formulate the O-1 QKP we first define some terminology. Each illustrates the generated assortments using this method for both the
product o will have a vertical label ao , an associated price co , and visual and multimodal variants.
associated topic distribution to . All products will have a pairwise
distance associated with them and every other product di, j which 4 RESULTS AND DISCUSSION
is the Mahalanobis distance between products oi and o j built on
the purchases and topic distributions. Additionally the products all This section describes a large-scale experiment on Overstock 1 to
have a score qo to represent their value to the seed, here we use the determine whether simultaneously learning style from text and im-
inverse of the distance from the product to the seed. Let M be the set ages provides better results than learning style from only images.
of products comprising the assortment. Let B be the total available All offline validations and online A/B test assume 2 variants: a
budget minus the cost of the seed. The 0-1 knapsack problem can visual-only assortment recommender system, which is built on top
then be formulated as: of BoVW representation of product images, and a multimodal one,
which is built on text-based attributes and visual words. Our site
does offer manually curated collections, but there is little overlap on
M M
Õ Õ the products selected in the manual process and those selected for
maximize qi + 1/di, j , j , i
our automated system. The methodology used to create these col-
i=1 j=1
lections also involves selecting products from the same vendor from
subject to:
(1) the same product line, which can often be viewed as substitutable
M
Õ products as well. These would thus serve as a poor comparison
∀i : ci ≤ B to our model, which attempts to find stylistically complimentary
i=1 products of different categories regardless of vendor, rather than
∀ k : mink ≤ count(ao = k) ≤ max k similar/identical products. As such we do not use them as a baseline
A greedy approximation can now be formulated to follow the con- to compare our model against.
straints of the 0-1 QKP. First a solution is initialized, then iteratively,
Online Evaluation. We run an A/B test on Overstock 1 with both
items are swapped until the system converges to an optimal solution.
the visual variant and multimodal variant assortments on product
To get the initial set, products are all scored by their total potential
pages. The users are provided an option to add an entire assortment
for compatibility (the sum of the score of this product with the sum
or individual items from the recommendations module to their cart.
total of all reciprocals of pairwise Mahalanobis distance) divided
User engagement with recommendations are measured via click-
by their price. Products are then sorted by this ratio and the highest
through rate, CTR, on assortment recommendations is a classic
scored are added until vertical constraints or budget constraints are
measure of user engagement with product recommendations in e-
met. Then each product is considered for a swap with other products
commerce platforms. Our findings show that the multimodal variant
to improve the total compatibility of the assortment.
outperforms the visual variant, showing a statistically significant lift
of 10.9% relative to a visual only baseline.
3.5 Greedy Method
For our online evaluation, we relax the budget constraints, so that we Offline Evaluations. As we are recommending sets of products
can display the assortment on product pages. This allows us to select together, we can’t use traditional offline evaluations, such as AUC,
the best assortment per product. We can easily reapply the budget MAP and Recall. For scoring, we have taken the assortments and
constraint for user assortments. As such, we use the following greedy scored them with the average click-based Jaccard coefficient calcu-
method to generate the assortments. The assortment is initialized lated pairwise over all products contained within the assortment. We
as first only containing the seed items, all other verticals are empty. make use of logs of product clicks over the last 2 months to build
Each vertical is then iteratively considered for new products while this score. This tells us how compatible the various products within
holding all other verticals constant. New products are chosen from an assortment are to one another.
Assortment Recommendations SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 6: Assortments from our visual variant (top) and multimodal variant (bottom) with the same seeds are depicted above. The visual assortments
tend to provide a more-of-the-same assortment, while the multimodal is able to create a diverse set of products that still forms a cohesive style. When
the visual assortment can’t find products similar to either seed, it selects products which are similar to one another. In some cases it can match
products to one seed, and ignores the other, which leaves the second seed item looking out of place. The multimodal variant, which can take advantage
of text data as well, is able to find pieces to create a cohesive look incorporating elements of both seeds.

To calculate the assortment Jaccard, we look at S, the set of all Table 1: Jaccard Score averages and max values for assortments of both
user browsing sessions that included clicks on at least 2 items of variants.
living room furniture. A user browsing session only spans one visit,
so we can assume that when items are co-clicked within the same Modality Avg Max
session, the user still has the same intent, and is implicitly seeing visual 0.0015 0.0220
both products as satisfying this intent. We take all user browsing multimodal 0.0027 0.0362
sessions over a two month span for our dataset. We denote the subset
of S that includes sessions that have clicked on product a as S a .
We calculate the Jaccard coefficient for all pairs of products (a, b)
within our corpus as follows: the multimodal variant and the visual variant may reflect an underly-
ing bias in the click data: The text modality looks at the same data
|S a ∩ Sb |
J (a, b) = used in site search and navigation to surface groups of products to
|S a ∪ Sb |
users, so co-click data may reflect better discoverability of items
In contrast to the standard Jaccard formulation, if |S a ∪ Sb | = 0, with similar textual features as opposed to similar visual features.
we consider J (a, b) = 0. This modification is to prevent products We also examine the distribution of topics from each assortment.
that were never visited together from being counted as perfectly The topic representation of products in our PolyLDA generated
compatible with each other, since the standard formulation defaults space is more diverse than that of the LDA. The products in the
these cases to 1. LDA space usually are strongly attributed to one or two products,
For assortment A we calculate an assortment Jaccard score JA while the PolyLDA does a better job of representing a product as
as a simple average of the pairwise Jaccard coefficient for all items a mixture of topics. This affects style as the products within the
within the assortment, excluding the Jaccard coefficient between topics that are learned are very visually similar, lending themselves
the two seeds. An example assortment with a good Jaccard score is to substitutability rather than complimentary behavior.
shown in Figure 1. Figure 7 shows how many topics compose each assortment. The
The multimodal maintains a much higher average assortment Jac- distribution for the visual variant is heavily skewed, with the majority
card than the visual-only variant. The dramatic difference between of assortments only being composed of a few topics. The multimodal
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Murium Iqbal, Adair Kovac, and Kamelia Aryafar

·104 advantage of text-based product attributes in addition to image rep-
Visual resentation. This variant utilizes Polylingual LDA (PolyLDA) to
1.5 Multimodal create trends that are based on two modalities, images and text.
We have featured multiple assortments generated from both mod-
els. We have evaluated our results through a set of offline validations
Total Assortments

1
and an online large-scale A/B test on Overstock. Our experimental
results indicate that incorporating both image and text data provides
more a more cohesive visual style than using only images and can
enhance user engagement metrics with recommendations module.
0.5
We also show that PolyLDA provides a meaningful way to simulta-
neously learn style across text and image data.
0

0 5 10 15 20 25 REFERENCES
[1] Moran Beladev, Lior Rokach, and Bracha Shapira. 2016. Recommender systems
Number of Topics in Assortment for product bundling. Knowledge-Based Systems 111 (2016), 193–206.
[2] Sonia Bergamaschi, Laura Po, and Serena Sorrentino. 2014. Comparing Topic
Models for a Movie Recommendation System.. In WEBIST (2). 172–183.
Figure 7: The above chart tabulates the number of topics that non-
[3] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet
trivially contribute to an assortment. The vast majority of visual as- allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
sortments have 1-4 topics associated with them, as depicted by the peak [4] Tiago Cunha, Carlos Soares, and André CPLF Carvalho. 2017. Metalearning
in the red line. On the other hand, the multimodal assortments can have for Context-aware Filtering: Selection of Tensor Factorization Algorithms. In
Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM,
anywhere from one to 20 topics associated with them, which allows for
14–22.
more diverse product representation. This is desirable as the uncovered [5] Giorgio Gallo, Peter L Hammer, and Bruno Simeone. 1980. Quadratic knapsack
topics show very similar items, not necessarily complimentary ones as problems. In Combinatorial optimization. Springer, 132–149.
depicted in Figure 5. A user would likely not want an entire set of bright [6] Robert Garfinkel, Ram Gopal, Arvind Tripathi, and Fang Yin. 2006. Design
of a shopbot and recommender system for bundle purchases. Decision Support
red furniture. Mixing among the topics to create a complimentary set
Systems 42, 3 (2006), 1974–1986.
is more desirable than an entire set of the exact same visual features. [7] Prem Gopalan, Jake M Hofman, and David M Blei. 2015. Scalable Recommenda-
tion with Hierarchical Poisson Factorization.. In UAI. 326–335.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid-
ual learning for image recognition. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 770–778.
[9] Ruining He, Charles Packer, and Julian McAuley. 2016. Learning compatibil-
variant has much more variety, with assortments composed of only ity across categories for heterogeneous item recommendation. In Data Mining
(ICDM), 2016 IEEE 16th International Conference on. IEEE, 937–942.
a few topics, to assortments composed of nearly half the topics [10] Wei-Lin Hsiao and Kristen Grauman. 2017. Creating Capsule Wardrobes from
being equally likely. This shows us that the multimodal variant Fashion Images. arXiv preprint arXiv:1712.02662 (2017).
offers more diverse assortments. This is preferred since, as depicted [11] Diane J Hu, Rob Hall, and Josh Attenberg. 2014. Style in the long tail: Discovering
unique interests with latent variable models in large scale social e-commerce. In
in 5, the topics themselves often contain substitutable rather than Proceedings of the 20th ACM SIGKDD international conference on Knowledge
complimentary items. Properly blending the topics together to create discovery and data mining. ACM, 1640–1649.
[12] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering
a cohesive look results in better assortments, as depicted in Figure 6. for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE
International Conference on. Ieee, 263–272.
[13] Murium Iqbal, Adair Kovac, and Kamelia Aryafar. 2018. Discovering Style
5 CONCLUSION Trends through Deep Visually Aware Latent Item Embeddings. CVPR Workshops
(2018).
Online shopping is a visual experience. Visually-aware recommen- [14] Qi Liu, Yong Ge, Zhongmou Li, Enhong Chen, and Hui Xiong. 2011. Personal-
dations are crucial to online shopping and e-commerce platforms. ized travel package recommendation. In Data Mining (ICDM), 2011 IEEE 11th
Yet, systems which rely solely on images suffer from a lack of diver- International Conference on. IEEE, 407–416.
[15] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu,
sity. By incorporating both image and text data we are able to create and Shuicheng Yan. 2012. Hi, magic closet, tell me what to wear!. In Proceedings
cohesive styles. of the 20th ACM international conference on Multimedia. ACM, 619–628.
[16] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
In this paper we introduced a deep visually-aware large-scale 2015. Image-based recommendations on styles and substitutes. In Proceedings of
assortments recommender system for Overstock 1 . Our assortment the 38th International ACM SIGIR Conference on Research and Development in
recommender system takes advantage of product images to create Information Retrieval. ACM, 43–52.
[17] David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and An-
visually coherent trends from Overstock products. We introduced drew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009
two variants: a visual-only variant and a multimodal variant. Our Conference on Empirical Methods in Natural Language Processing: Volume
visual-only variant creates a bag-of-visual-words representation of 2-Volume 2. Association for Computational Linguistics, 880–889.
[18] Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization.
product images by thresholding the activations from specific layers In Advances in neural information processing systems. 1257–1264.
of a pre-trained deep residual neural network, Resnet-50. It then [19] Ivet Rafegas, Maria Vanrell, and Luís A Alexandre. 2017. Understanding trained
CNNs by indexing neuron selectivity. arXiv preprint arXiv:1702.00382 (2017).
applies topic modeling (LDA) on product image representations to [20] Stephen Roller and Sabine Schulte Im Walde. 2013. A multimodal LDA model
create visual trends among Overstock products. We then proposed integrating textual, cognitive and visual modalities. In Proceedings of the 2013
a greedy approach ( with and without budget constraints ) to create Conference on Empirical Methods in Natural Language Processing. 1146–1157.
[21] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep
assortment recommendations based on seed items that maximize content-based music recommendation. In Advances in neural information pro-
the visual compatibility of the set. Our multimodal variant takes cessing systems. 2643–2651.
Assortment Recommendations SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

[22] Robin Van Meteren and Maarten Van Someren. 2000. Using content-based
filtering for recommendation. In Proceedings of the Machine Learning in the New
Information Age: MLnet/ECML2000 Workshop. 47–56.
[23] Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and Serge
Belongie. 2015. Learning visual clothing style with heterogeneous dyadic co-
occurrences. In Computer Vision (ICCV), 2015 IEEE International Conference
on. IEEE, 4642–4650.
[24] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh
Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings
of the 26th Annual International Conference on Machine Learning. ACM, 1113–
1120.
[25] Min Xie, Laks VS Lakshmanan, and Peter T Wood. 2010. Breaking out of the
box of recommendations: from items to packages. In Proceedings of the fourth
ACM conference on Recommender systems. ACM, 151–158.
[26] Shengli Xie and Yifan Feng. 2015. A recommendation system combining LDA
and collaborative filtering method for Scenic Spot. In Information Science and
Control Engineering (ICISCE), 2015 2nd International Conference on. IEEE,
67–71.
[27] Tao Zhu, Patrick Harrington, Junjun Li, and Lei Tang. 2014. Bundle recom-
mendation in ecommerce. In Proceedings of the 37th international ACM SIGIR
conference on Research & development in information retrieval. ACM, 657–666.