=Paper= {{Paper |id=Vol-2319/paper18 |storemode=property |title=A Multimodal Recommender System for Large-scale Assortment Generation in e-Commerce |pdfUrl=https://ceur-ws.org/Vol-2319/paper18.pdf |volume=Vol-2319 |authors=Murium Iqbal,Adair Kovac,Kamelia Aryafar |dblpUrl=https://dblp.org/rec/conf/sigir/IqbalKA18 }} ==A Multimodal Recommender System for Large-scale Assortment Generation in e-Commerce== https://ceur-ws.org/Vol-2319/paper18.pdf
                A Multimodal Recommender System for Large-scale
                      Assortment Generation in E-commerce
                       Murium Iqbal                                                           Adair Kovac                                       Kamelia Aryafar
                       Overstock                                                            Overstock                                              Overstock
                     Midvale, Utah                                                        Midvale, Utah                                          Midvale, Utah
                 miqbal@Overstock.com                                                 akovac@Overstock.com                                  karyafar@Overstock.com
ABSTRACT
E-commerce platforms surface interesting products largely through
product recommendations that capture users’ styles and aesthetic
preferences. Curating recommendations as a complete complemen-
tary set, or assortment, is critical for a successful e-commerce ex-
perience, especially for product categories such as furniture, where
items are selected together with the overall theme, style or ambiance
of a space in mind. In this paper, we propose two visually-aware
recommender systems that can automatically curate an assortment
of living room furniture around a couple of pre-selected seed pieces
for the room. The first system aims to maximize the visual-based
style compatibility of the entire selection by making use of transfer
learning and topic modeling. The second system extends the first by
incorporating text data and applying polylingual topic modeling to
infer style over both modalities. We review the production pipeline
                                                                                                         Figure 1: An automatically generated assortment from the multimodal
for surfacing these visually-aware recommender systems and com-
                                                                                                         approach is shown.
pare them through offline validations and large-scale online A/B
tests on Overstock 1 . Our experimental results show that compli-
mentary style is best discovered over product sets when both visual                                      1    INTRODUCTION
and textual data are incorporated.                                                                       Overstock 1 is an e-commerce platform with the goal of creating
                                                                                                         dream homes for all. Users browse Overstock’s catalog to select
CCS CONCEPTS                                                                                             pieces that complement one another while matching the stylistic
•Information systems → Recommender systems; Collaborative                                                settings and color palettes of their rooms. As furniture is not a dis-
filtering; Presentation of retrieval results; •Computing methodolo-                                      posable product, furniture purchases are subject to careful scrutiny
gies → Neural networks;                                                                                  of aesthetics and strict budgeting. Brick and mortar stores often
                                                                                                         inspire consumers by creating furniture showrooms, in some cases
                                                                                                         pushing consumers to walk through their carefully selected furniture
KEYWORDS
                                                                                                         displays. These assorted showrooms alleviate the creative and stylis-
recommender system, visual document, topic modeling, set recom-                                          tic pressure on the consumer, since the set is already on display with
mendation, quadratic knapsack problem, product recommendation                                            all the necessary pieces.
                                                                                                             By crafting an appropriate recommender system, this experience
ACM Reference format:
                                                                                                         can be recreated on an e-commerce platform. Figure 1 illustrates
Murium Iqbal, Adair Kovac, and Kamelia Aryafar. 2018. A Multimodal Rec-
ommender System for Large-scale Assortment Generation in E-commerce. In
                                                                                                         an example of the proposed system’s automatically generated show-
Proceedings of ACM SIGIR Workshop on eCommerce, Ann Arbor, Michigan,                                     room, built around two seed products selected from Overstock 1 .
USA, July 2018 (SIGIR 2018 eCom), 9 pages.                                                               The goal of this recommender system is to provide set recommenda-
DOI: 10.1145/nnnnnnn.nnnnnnn                                                                             tions that adhere to a general theme or cohesive visual style while
                                                                                                         accounting for essential item constraints 2 . Set recommendations
                                                                                                         have become more prominent with the rise of subscription box ser-
1 www.overstock.com                                                                                      vices in various domains such as fashion (e.g. StitchFix 3 ), jewelry
                                                                                                         (e.g. Rocksbox 4 ) and beauty products (e.g. Birchbox 5 ). The criteria
                                                                                                         for selecting an assortment depends heavily on the product space
 Permission
Copyright      to make
           © 2018   by the digital or hard Copying
                           paper’s authors.  copies permitted
                                                     of part orforall of this
                                                                   private andwork   for purposes.
                                                                               academic  personal or     which can inform the latent space of user preferences and product
In: J. Degenhardt,
 classroom   use is G.   Di Fabbrizio,
                     granted    withoutS.feeKallumadi,
                                              providedM.that
                                                          Kumar,
                                                             copies Y.-C.
                                                                       areLin,
                                                                           notA.made
                                                                                  Trotman,  H. Zhao
                                                                                       or distributed
(eds.): Proceedings  of the SIGIR  2018 eCom                                                             2 Essential item constraints refers to specific products which must be included in an
 for profit or commercial     advantage    andworkshop,  12 bear
                                                that copies July, 2018,   Ann Arbor,
                                                                   this notice  and Michigan,   USA,
                                                                                     the full citation
published at http://ceur-ws.org                                                                          assortment, e.g. a bed frame in a bedroom set or a vendor must-have in a subscription
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).                                                         box.
                                                                                                         3 www.stitchfix.com
SIGIR 2018 eCom, Ann Arbor, Michigan, USA
                                                                                                         4 www.rocksbox.com
© 2018 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn                                                                             5 www.birchbox.com
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                        Murium Iqbal, Adair Kovac, and Kamelia Aryafar


                                                                          specifically Latent Dirichlet Allocation (LDA)[3], has also been
                                                                          applied to create topic based recommendations both off of text input
                                                                          as well as from implicit feedback [11] [2] [26]. These systems
                                                                          work by scoring individual products against one another or against
                                                                          users to see which are most similar. Here we look to extend this
                                                                          methodology by applying topic modeling over both image and text
                                                                          data. Rather than simply concatenating image and text features and
                                                                          linearly combining them, PolyLDA enables us to learn two distinct
                                                                          but coupled latent style representations. This allows for a versatile
                                                                          interpretation of style [13]. We then use these learned styles to make
                                                                          bundle or assortment recommendations rather than the traditional
                                                                          single item recommendations.
                                                                             To facilitate style discovery from the images, we process them
Figure 2: A system overview diagram is presented above. We use prod-
uct data and user engagement data residing within our Hadoop cluster
                                                                          via prevalent deep learning techniques. Deep residual networks have
and our NFS/S3 store to train the model. Most of the training is done     recently been shown as a powerful model to capture style based
on a one box GPU server, which runs the Resnet-50, Mallet and our         preferences for creating visual wardrobes from fashion images [10].
greedy algorithm. All results are served to Overstock 1 users from Cas-   With the goal of learning visually coherent styles, we apply trans-
sandra.                                                                   fer learning, by using a Resnet-50 6 which was pre-trained on
                                                                          ImageNet 7 [8]. We explore interpreting the convolutional neural
                                                                          network by indexing the activations on channels within the convolu-
representations. Personal care product recommendations, for ex-           tional layers [19]. We use a neural network to learn filters for our
ample, rely on user preferences, such as skin type, which can be          data and simply index their responses to images to create visual doc-
inferred from product text-based attributes. Fashion, jewelry and         uments, similar to the older bag of word methods [24]. To discover
furniture shopping, on the other hand, are predominantly visual expe-     visually-aware trends, we use these documents with Latent Dirichlet
riences. This motivates a visually-aware representation for products      Allocation (LDA) [3].
and style-based preferences for users.                                       While we use LDA for topic modeling on single-modality image
   We propose two assortment recommender systems. The first               data, we need an extension to interpret both visual semantic features
takes advantage of visually semantic features transferred from a          and text-based attribute data as complementary modalities. Roller
deep convolutional neural network to learn style. The second uses         et al. [20] create a multimodal topic model that assumes words and
both these visual features and product text-based attributes to learn     corresponding visual features occur in pairs and should be captured
style across the multimodal dataset. Our hypothesis is that while         as tuples within the topic distribution. This model would work well
the visual style can help find similar products, use of text data in      for descriptions of images, but the underlying assumption could
conjunction with images will result in more complimentary, and            prove to be too stringent to generalize to our application. Mimno et
cohesive stylized assortments.                                            al. [17] offer a more flexible extension of LDA, Polylingual LDA
                                                                          (PolyLDA), which handles tuples of documents in different lan-
2   RELATED WORK                                                          guages with assumed identical topic distributions. The documents
As the prevalence of recommendation systems grows, e-commerce             within a tuple do not need to be direct translations of one another,
platforms are looking to increase the influence on customer pur-          and the topics themselves have distinct sets of words for each lan-
chases by not just recommending one item at a time but several in a       guage. By treating our image data and our text data as two separate
bundle together [25] [27] [1] [6]. This is true across industries, from   languages, we can use PolyLDA directly to create a more flexible
travel agencies to clothing [10] [14] [15]. Specifically Zhu et al.[27]   multimodal topic model with which to infer style.
model the problem of bundle recommendations as the Quadratic                 Hsiao and Grauman [10] also apply PolyLDA to infer style from
Knapsack Problem (QKP), and find an approximation to it. They             images, but they use complementary clothing types (e.g. pants
use only implicit feedback data, rather than content data to build        and blouse) as languages in a compatibility model instead of using
their system, allowing for the bundles to be personalized. Using this     PolyLDA to handle different modalities of data for the same item.
approach, we also view our system as a QKP, but look to leverage          The documents they use for their topic models are generated using a
primarily content data, incorporating implicit feedback as a minor        Resnet trained to automatically detect their pre-defined text attributes
signal. To capture a user’s preference, the system we propose allows      within images and apply the appropriate labels.
a user to present a seed product, around which we can create the             Much of the work performed on traditional recommendation sys-
bundle. This allows us to circumvent the cold start problem, as           tems can be applied here, but additional constraints must be taken
any user, regardless of having a history on our site, can still build     into consideration. Products can’t simply be visually similar to be
assortment recommendations. New products are also seamlessly              purchased together, but must be complimentary as well. For exam-
incorporated, as long as images or text data is made available.           ple, having two similar brightly colored upholstered chairs as a result
   Recommendation systems is a well studied field which has two           in a traditional recommendation is perfectly acceptable, but if they
large over arching types: collaborative filtering methods [18] [12]
[7], and content based methods [4] [21] [22] of which many rely           6 https://github.com/facebook/fb.resnet.torch

on learning latent representation of the input data. Topic modeling,      7 http://www.image-net.org/
Assortment Recommendations                                                      SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


are both taken into one assortment with the intention of being bought
together, they may clash. Here we show that by incorporating text
data, we are able to avoid assortments of incredibly visually similar,
interchangeable products which are not at all complimentary. These
nuances can be further addressed by incorporating purchase data to
infer what types of products are not just similar, but also compatible,
and likely to be bought together. To create the final assortment rec-
ommendations, we define a distance measure that combines stylistic
similarity with implicit feedback from users, in the form of pur-
chases, to develop a concept of affinity similar to approaches in
composite recommendation systems such as [9, 10, 23, 27].
   This rest of this paper is organized as follows: In Section 3, we
present our production pipeline for assortment recommendations as
deployed in Overstock 1 and explore our product embeddings as used
with topic modeling techniques. We also review our formulation of         Figure 3: Here we depict activations from the layers used to create our
assortment generation as a 0-1 quadratic knapsack problem and our         visual documents. The leftmost column is the raw image. The next col-
greedy optimization for building assortments. Section 4 presents          umn to the right is the cropped patch which is passed through the net-
our offline and online experimental results. Finally, we conclude our     work. In the next 4 columns, we see the response to these patches from
paper in Section 5 and propose future directions.                         9 channels each of convolutional layers 8, 18, 31 and 43. Lighter col-
                                                                          ors mean strong responses, while darker colors mean weak responses.
                                                                          We threshold these responses to determine whether or not the "word"
3     METHODOLOGY                                                         each channel represents is present within the image. This methodology
Here we provide an overview of the production system used to              seems to work particularly well for our upholstered furniture, showing
surface recommendations throughout Overstock 1 and describe our           distinct responses to the different patterns.
assortment recommender system in detail. We first explain our
product embeddings as text-based bag-of-words and bag-of-visual-
words (BoVW) documents. Then we explain the LDA approaches
                                                                          which products have an associated assortment to present to users.
used to discover text-based and visual styles across our platform.
                                                                          Assortment recommendations are featured on corresponding product
We finally discuss our formulation of assortment generation as a 0-1
                                                                          pages, below the product description, providing an option for users
quadratic knapsack problem with budget constraints and our greedy
                                                                          to purchase an entire assortment, or just parts of it. A brief diagram
assortment recommendations algorithm.
                                                                          of the systems involved is presented in Figure 2.
3.1      System Overview
                                                                          3.2     Product Embeddings
Product recommendations on Overstock 1 are surfaced through mul-
tiple carousels across the platform. User interactions with products      Products on Overstock 1 have several images of the item for sale and
and recommendations is recorded along side purchase data. Each            different forms of associated text data including title, description and
product on the platform is represented by corresponding title, text       attribute tags. Product attributes are descriptive tags associated with
attribute tags, text-based descriptions and images. The first step        a product. Some example attribute categories are color, size, style,
in generating recommendations is processing the large volume of           composing materials, finish, and brand. In this paper, we utilize
product and user data which are hosted on Hadoop Servers.                 these attributes as the primary text-based information since they
   After initial processing on Hadoop, the data is transfered to a one-   often provide a rich text representation of the item that compliments
box CUDA-enabled GPU server. Here product images are passed               provided images. All product attribute data is processed to remove
through a publicly available Resnet-50 to create image-based              stop words and are concatenated with each product title to form a
documents. Both the image and text documents are then fed into            bag-of-words document.
Mallet 8 to generate topic representations of products. We create             We use the methodology described in [13] to create visual doc-
two topic variants, a visual variant, that uses the image data with       uments to be used with LDA via a Resnet. We then combine both
LDA and a multimodal variant, that uses both text and image data          visual and text data together via PolyLDA.
with PolyLDA.                                                                 The convolutional layers within a Resnet are composed of a series
   Once the topic distributions are created, the products reside within   of small learned filters. Each filter is convolved with the input in
their defined feature space. Hereafter, the methodology for both vari-    steps along it’s height and width. The result of this process is a
ants is identical. We construct a distance measure over the topic         2-dimensional grid of activations that represents the response from
space, using implicit feedback from our users, to model compatibil-       the filter at corresponding locations of the input. The learned filters
ity. The assortments are then generated by finding products which         within a Resnet respond to specific patterns. By viewing these
minimize this distance to predefined seed product pairs.                  filters as words, and indexing them, Iqbal et al. [13] show that a
   The generated assortments are pushed to our Cassandra database         BoVW document can be created which, when combined with LDA,
cluster. Our website servers can query Cassandra directly to see          is effective at uncovering style. To index a layer, we simply threshold
                                                                          the response from the filter to indicate whether or not it is sufficiently
8 http://mallet.cs.umass.edu/                                             activated by the image to be included within the image’s document.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                       Murium Iqbal, Adair Kovac, and Kamelia Aryafar


                                                                                For the multimodal variant, each visual document is paired with
                                                                            it’s text equivalent in a tuple. These tuples are then used with Polylin-
                                                                            gual LDA (PolyLDA) [17]. Products that are missing either a text
                                                                            document or a visual document can still be used with PolyLDA, as
                                                                            it is robust to missing documents within tuples. A diagram of the
                                                                            system for multimodal document creation is provided in Figure 4.
                                                                            PolyLDA is an extension of LDA meant to handle loosely equivalent
                                                                            documents in different languages. It takes as input tuples of docu-
                                                                            ments, each in a different language but with the same exact topic
                                                                            distribution. Each resulting latent topic from this method contains a
                                                                            distinct associated set of word distributions for each language. Docu-
Figure 4: Each product’s corresponding text and images are collected.       ments aren’t required to be direct translations of one another, which
Our system then discards images with noisy scenes, keeping only those       allows for flexibility. A word in one language could be attributed
with white backgrounds and passing them through Resnet-50. Chan-            to a given topic while its direct translation in another language may
nel activations from layers 8,18,31, and 43 are thresholded and indexed     be attributed to a different topic. Words that only appear in one
to create visual documents. The union of these documents is taken as        language can also be present in topics despite having no equivalent
the visual representation of the product. The text data is stripped of      translations in the corpus. PolyLDA extends LDA by assuming the
stopwords and paired with the corresponding visual document to cre-         following generative procedure:
ate a tuple for ingestion by PolyLDA
                                                                                  (1) For a given tuple of documents, {d 1 . . . d L }, initialize a
                                                                                      single set of topic distributions, θ ∼ Dir(α)
   Given this methodology for creating image documents, our visual                (2) For K topic sets with L languages, initialize the set of word
word vocabulary is defined entirely by the channels in the layers that                distributions, ϕ k,l ∼ Dir(β) for {k = 1 . . . K } and {l =
we choose to index. We tested creating documents using various                        1 . . . L}
layers through the network. We found that combining several middle                (3) For the k th word in the l th document in the i th tuple select
layers gave us the best results. After experimenting we empirically                   a topic zi,l from θ i and a word, w i,l,k , from ϕ zi,l
chose to use Resnet-50 layers 8, 18, 31, and 43. The combination               For our multimodal variant, we make use of the flexibility af-
of these layers provided significantly better results, per visual inspec-   forded by PolyLDA by applying it to our visual and text documents,
tion, than any single layer or other combinations. This yielded a total     representing each modality as a distinct language attempting to de-
vocabulary size of 2816 visual words over our image documents.              scribe the same product. This method should allow for topics that
Since multiple images can be provided for each product, we take the         afford better generality. Intuitively for our corpus, we suspect that
union of all present visual words within the corresponding images as        a topic could capture relationships between certain visual features,
the visual document for the product itself. The process is depicted         like thick wood, with textual attributes, like "quality", that don’t
in Figure 4.                                                                have a direct visual representation but do share an underlying con-
                                                                            textual relationship. By combining modalities the model also affords
3.3     Topic Modeling                                                      us the ability to infer a more complete style representation given
Once the visual documents are created, they can be used directly with       incomplete data. For example, we can apply our multimodal topics
LDA to create topics for our visual variant. LDA is an unsupervised         to products with only text attributes available and still infer style that
generative model which infers probability distributions of words            would normally be captured by the missing image data.
within topics and models documents as a mixture of those topics                We provide visualizations of the resulting topics from both vari-
[3]. Given a set of documents and a number of target topics the             ants in Figure 5. To create these visualizations we selected the
model assumes the following generative process was used to create           products that are maximally aligned with each topic. In our case,
the documents.                                                              we have empirically seen that the topics relate well to various styles
      (1) For M documents, initialize the set of topic distributions,       while reducing our feature space from a large number of attribute
          θ i ∼ Dir(α) for {i = 1 . . . M}                                  tags, title words, and image-based channel activations to a succinct
      (2) For K topics, initialize the set of word distributions, ϕ k ∼     k-dimensional space (where k is the number of topics). These styles
          Dir(β) for {k = 1 . . . K}                                        are not necessarily complimentary though, especially in the case of
      (3) For the k t h word in the i t h document select a topic zi from   the visual topics. We can see that the items within the topics seem
          θ i and a word, w i,k , from ϕ zi                                 more substitutable, e.g. in the cases of brightly colored furniture,
                                                                            similarly patterned upholstery, and mirrored surfaces.
where Dir is the Dirichlet distribution.
   The model initializes the parameters randomly then iteratively
updates these parameters via Gibbs Sampling and variational Bayes           3.4     Knapsack Problem
inference. After many iterations the topic distributions converge to        Once we have all our products residing within the topics’ feature
a stable state and the resulting topics can be used as a low dimen-         space, we can begin to build out our assortments. The process for
sional feature space that captures the salient content of the original      generating the assortments is the same for both the visual variant and
documents.                                                                  the multimodal variant, with the only difference between the two
Assortment Recommendations                                                          SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA




Figure 5: Representative items from each of six topics in the visual variant on the left (a) and multimodal variant on the right (b) are depicted above.


being the topic representations of products. To build these assort-            also have an additional associated pair-wise profit when selected
ments, we first define the necessary components of an assortment as            together for the knapsack.
verticals. The verticals we define here enable us to build living room              For a given seed, we can find the optimal assortments by trying to
assortments, but we would like to emphasize that verticals can be              find the products that are nearest the seed in the topic space. We can
defined for any e-commerce platform. If desired, the verticals can             add the additional constraint that we want all products within the
be left undefined, and the assortment can be built with no vertical            assortment to be nearest one another in the topic space as well. The
constraints. For our specific application verticals are critical pieces        total assortment must remain within the assumed budget constraint.
of living room furniture, e.g. coffee table, chair, or accent table. Our       Each product has an associated vertical, and we can add in some
definitions are manually applied to the products and are listed below.         constraints on the minimum and maximum number of products each
(Here a couch set can be either a sectional sofa or a sofa and loveseat        vertical can contribute to the final assortment.
combination.)                                                                       To calculate proximity we use a Mahalanobis distance built on
                                                                               purchase data similar to [16]. The Mahalanobis distance allows us
                                                                               to leverage the implicit feedback of our users to understand how our
  Verticals: {Couch Set, Coffee Table, Accent Table, Entertainment             topics distribute and relate to one another across multiple products
Center, Bookshelf, Ottoman, Chair}                                             within a purchase.
                                                                                    We assume that purchases of living room furniture made by the
   We chose to build our assortments around seed pieces. For our               same user over a 3 month period are being used in the same room,
current application, we generated seeds, but we can easily allow               and as such can be said to represent a compatible assortment. After
users to provide their own seeds around which we can curate an as-             trimming to include only purchases with 3-10 pieces of living room
sortment. We selected seeds as preferred pairings of the most crucial          furniture, so as to avoid noise and erroneous signals from bulk pur-
verticals in any living room space, namely a couch set and a coffee            chasers, we create a dataset of roughly 7.5k purchased assortments.
table. Although all the other verticals can be seen as optional, these         We take the L 2 normalized sum of the topic distributions of the co-
two verticals are defining members of a living room and therefore              purchased products as a topic representation of the purchase itself,
must be present in any assortment. We chose pairs of sofas and                 and use this dataset to calculate the covariance matrix used for our
coffee tables which are most frequently co-clicked for our seeds.              Mahalanobis distance.
   We then assume a budget constraint for the entire assortment. As                 The Mahalanobis distance is defined as d M (x i ,x j ) = (x i -x j )M(x i -
such we can formulate our assortment generation as a 0-1 quadratic             x j )T , Where M is the covariance matrix for the topics learned from
knapsack problem (0-1 QKP) as defined by Gallo et al [5]. The 0-1              our purchase data. For each seed pair of coffee table and couch set,
QKP assumes that a knapsack with limited space needs to be filled              we greedily swap products in and out of our assortment until we
with a set of items. These items all have an intrinsic value. Items
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                         Murium Iqbal, Adair Kovac, and Kamelia Aryafar

Algorithm 1 Generate Assortments                                              the pool of products that are labeled as candidates for this vertical.
    δ ←ϵ +1                                                                   Those which are closest to the total assortment (minus the current
 2: Ai,pr ev ← {}∀i∈ {V−S}                                                    vertical) at the previous time-step are added in, until the size of the
    Ai,pr ev ←{pi }∀i∈ {S}                                                    vertical is met. The sizes of the verticals in our experiment were
 4: while δ ≥ ϵ do                                                            set by us, but a user can dictate how many side tables, bookshelves,
        δ ←0                                                                  chairs, etc they wish to add to their assortment.
 6:     for i ∈ {V−S} do                                                          Let V be the set of all verticals. S is the set of verticals used in the
            t A ← k( tp )k ∀p j ∈ A j,pr ev , ∀j ∈ V , i
                    Í                                                         seed. Ai is the set of products selected for the assortment in vertical i
 8:         while size(Ai ) < i size do                                       for the current timestep. Each vertical i has an associated size, i size ,
                Ai ← Ai ∪ argmin(dM (to , t A ))                              which is the number of products we want for this vertical. Ai,pr ev
10:         δ ← δ + d M (t Ai , t Ai,pr ev )                                  is the set of products selected for the assortment in vertical i for the
                                                                              previous timestep. pi is an element of Ai,pr ev . oi is a candidate
                                                                              product for vertical i. to is the topic distribution of a product. d M is
                                                                              the Mahalanobis distance as defined above. δ is the change in the
converge to an ideal set of products. The only verticals which are            distance of the assortment from iteration to iteration. The greedy
not allowed to change are those in the seed.                                  approach is described using this notation in Algorithm 1. Figure 6
   To formulate the O-1 QKP we first define some terminology. Each            illustrates the generated assortments using this method for both the
product o will have a vertical label ao , an associated price co , and        visual and multimodal variants.
associated topic distribution to . All products will have a pairwise
distance associated with them and every other product di, j which             4    RESULTS AND DISCUSSION
is the Mahalanobis distance between products oi and o j built on
the purchases and topic distributions. Additionally the products all          This section describes a large-scale experiment on Overstock 1 to
have a score qo to represent their value to the seed, here we use the         determine whether simultaneously learning style from text and im-
inverse of the distance from the product to the seed. Let M be the set        ages provides better results than learning style from only images.
of products comprising the assortment. Let B be the total available           All offline validations and online A/B test assume 2 variants: a
budget minus the cost of the seed. The 0-1 knapsack problem can               visual-only assortment recommender system, which is built on top
then be formulated as:                                                        of BoVW representation of product images, and a multimodal one,
                                                                              which is built on text-based attributes and visual words. Our site
                                                                              does offer manually curated collections, but there is little overlap on
                             M            M
                             Õ            Õ                                   the products selected in the manual process and those selected for
                maximize           qi +          1/di, j , j , i
                                                                              our automated system. The methodology used to create these col-
                             i=1          j=1
                                                                              lections also involves selecting products from the same vendor from
                                                  subject to:
                                                                       (1)    the same product line, which can often be viewed as substitutable
                                                  M
                                                  Õ                           products as well. These would thus serve as a poor comparison
                                          ∀i :          ci ≤ B                to our model, which attempts to find stylistically complimentary
                                                  i=1                         products of different categories regardless of vendor, rather than
             ∀ k : mink ≤ count(ao = k) ≤ max k                               similar/identical products. As such we do not use them as a baseline
   A greedy approximation can now be formulated to follow the con-            to compare our model against.
straints of the 0-1 QKP. First a solution is initialized, then iteratively,
                                                                                 Online Evaluation. We run an A/B test on Overstock 1 with both
items are swapped until the system converges to an optimal solution.
                                                                              the visual variant and multimodal variant assortments on product
To get the initial set, products are all scored by their total potential
                                                                              pages. The users are provided an option to add an entire assortment
for compatibility (the sum of the score of this product with the sum
                                                                              or individual items from the recommendations module to their cart.
total of all reciprocals of pairwise Mahalanobis distance) divided
                                                                              User engagement with recommendations are measured via click-
by their price. Products are then sorted by this ratio and the highest
                                                                              through rate, CTR, on assortment recommendations is a classic
scored are added until vertical constraints or budget constraints are
                                                                              measure of user engagement with product recommendations in e-
met. Then each product is considered for a swap with other products
                                                                              commerce platforms. Our findings show that the multimodal variant
to improve the total compatibility of the assortment.
                                                                              outperforms the visual variant, showing a statistically significant lift
                                                                              of 10.9% relative to a visual only baseline.
3.5    Greedy Method
For our online evaluation, we relax the budget constraints, so that we           Offline Evaluations. As we are recommending sets of products
can display the assortment on product pages. This allows us to select         together, we can’t use traditional offline evaluations, such as AUC,
the best assortment per product. We can easily reapply the budget             MAP and Recall. For scoring, we have taken the assortments and
constraint for user assortments. As such, we use the following greedy         scored them with the average click-based Jaccard coefficient calcu-
method to generate the assortments. The assortment is initialized             lated pairwise over all products contained within the assortment. We
as first only containing the seed items, all other verticals are empty.       make use of logs of product clicks over the last 2 months to build
Each vertical is then iteratively considered for new products while           this score. This tells us how compatible the various products within
holding all other verticals constant. New products are chosen from            an assortment are to one another.
Assortment Recommendations                                                         SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA




                 (a)                                   (b)                                   (c)                                  (d)




                   (e)                                       (f)                                    (g)                                 (h)


Figure 6: Assortments from our visual variant (top) and multimodal variant (bottom) with the same seeds are depicted above. The visual assortments
tend to provide a more-of-the-same assortment, while the multimodal is able to create a diverse set of products that still forms a cohesive style. When
the visual assortment can’t find products similar to either seed, it selects products which are similar to one another. In some cases it can match
products to one seed, and ignores the other, which leaves the second seed item looking out of place. The multimodal variant, which can take advantage
of text data as well, is able to find pieces to create a cohesive look incorporating elements of both seeds.


   To calculate the assortment Jaccard, we look at S, the set of all           Table 1: Jaccard Score averages and max values for assortments of both
user browsing sessions that included clicks on at least 2 items of             variants.
living room furniture. A user browsing session only spans one visit,
so we can assume that when items are co-clicked within the same                                    Modality       Avg        Max
session, the user still has the same intent, and is implicitly seeing                              visual        0.0015     0.0220
both products as satisfying this intent. We take all user browsing                                 multimodal    0.0027     0.0362
sessions over a two month span for our dataset. We denote the subset
of S that includes sessions that have clicked on product a as S a .
   We calculate the Jaccard coefficient for all pairs of products (a, b)
within our corpus as follows:                                                 the multimodal variant and the visual variant may reflect an underly-
                                                                              ing bias in the click data: The text modality looks at the same data
                                      |S a ∩ Sb |
                         J (a, b) =                                           used in site search and navigation to surface groups of products to
                                      |S a ∪ Sb |
                                                                              users, so co-click data may reflect better discoverability of items
In contrast to the standard Jaccard formulation, if |S a ∪ Sb | = 0,          with similar textual features as opposed to similar visual features.
we consider J (a, b) = 0. This modification is to prevent products               We also examine the distribution of topics from each assortment.
that were never visited together from being counted as perfectly              The topic representation of products in our PolyLDA generated
compatible with each other, since the standard formulation defaults           space is more diverse than that of the LDA. The products in the
these cases to 1.                                                             LDA space usually are strongly attributed to one or two products,
   For assortment A we calculate an assortment Jaccard score JA               while the PolyLDA does a better job of representing a product as
as a simple average of the pairwise Jaccard coefficient for all items         a mixture of topics. This affects style as the products within the
within the assortment, excluding the Jaccard coefficient between              topics that are learned are very visually similar, lending themselves
the two seeds. An example assortment with a good Jaccard score is             to substitutability rather than complimentary behavior.
shown in Figure 1.                                                               Figure 7 shows how many topics compose each assortment. The
   The multimodal maintains a much higher average assortment Jac-             distribution for the visual variant is heavily skewed, with the majority
card than the visual-only variant. The dramatic difference between            of assortments only being composed of a few topics. The multimodal
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                             Murium Iqbal, Adair Kovac, and Kamelia Aryafar

                                ·104                                           advantage of text-based product attributes in addition to image rep-
                                                                  Visual       resentation. This variant utilizes Polylingual LDA (PolyLDA) to
                          1.5                                    Multimodal    create trends that are based on two modalities, images and text.
                                                                                  We have featured multiple assortments generated from both mod-
                                                                               els. We have evaluated our results through a set of offline validations
      Total Assortments




                           1
                                                                               and an online large-scale A/B test on Overstock. Our experimental
                                                                               results indicate that incorporating both image and text data provides
                                                                               more a more cohesive visual style than using only images and can
                                                                               enhance user engagement metrics with recommendations module.
                          0.5
                                                                               We also show that PolyLDA provides a meaningful way to simulta-
                                                                               neously learn style across text and image data.
                           0

                                 0      5       10       15       20      25   REFERENCES
                                                                                [1] Moran Beladev, Lior Rokach, and Bracha Shapira. 2016. Recommender systems
                                       Number of Topics in Assortment               for product bundling. Knowledge-Based Systems 111 (2016), 193–206.
                                                                                [2] Sonia Bergamaschi, Laura Po, and Serena Sorrentino. 2014. Comparing Topic
                                                                                    Models for a Movie Recommendation System.. In WEBIST (2). 172–183.
Figure 7: The above chart tabulates the number of topics that non-
                                                                                [3] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet
trivially contribute to an assortment. The vast majority of visual as-              allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
sortments have 1-4 topics associated with them, as depicted by the peak         [4] Tiago Cunha, Carlos Soares, and André CPLF Carvalho. 2017. Metalearning
in the red line. On the other hand, the multimodal assortments can have             for Context-aware Filtering: Selection of Tensor Factorization Algorithms. In
                                                                                    Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM,
anywhere from one to 20 topics associated with them, which allows for
                                                                                    14–22.
more diverse product representation. This is desirable as the uncovered         [5] Giorgio Gallo, Peter L Hammer, and Bruno Simeone. 1980. Quadratic knapsack
topics show very similar items, not necessarily complimentary ones as               problems. In Combinatorial optimization. Springer, 132–149.
depicted in Figure 5. A user would likely not want an entire set of bright      [6] Robert Garfinkel, Ram Gopal, Arvind Tripathi, and Fang Yin. 2006. Design
                                                                                    of a shopbot and recommender system for bundle purchases. Decision Support
red furniture. Mixing among the topics to create a complimentary set
                                                                                    Systems 42, 3 (2006), 1974–1986.
is more desirable than an entire set of the exact same visual features.         [7] Prem Gopalan, Jake M Hofman, and David M Blei. 2015. Scalable Recommenda-
                                                                                    tion with Hierarchical Poisson Factorization.. In UAI. 326–335.
                                                                                [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep resid-
                                                                                    ual learning for image recognition. In Proceedings of the IEEE conference on
                                                                                    computer vision and pattern recognition. 770–778.
                                                                                [9] Ruining He, Charles Packer, and Julian McAuley. 2016. Learning compatibil-
variant has much more variety, with assortments composed of only                    ity across categories for heterogeneous item recommendation. In Data Mining
                                                                                    (ICDM), 2016 IEEE 16th International Conference on. IEEE, 937–942.
a few topics, to assortments composed of nearly half the topics                [10] Wei-Lin Hsiao and Kristen Grauman. 2017. Creating Capsule Wardrobes from
being equally likely. This shows us that the multimodal variant                     Fashion Images. arXiv preprint arXiv:1712.02662 (2017).
offers more diverse assortments. This is preferred since, as depicted          [11] Diane J Hu, Rob Hall, and Josh Attenberg. 2014. Style in the long tail: Discovering
                                                                                    unique interests with latent variable models in large scale social e-commerce. In
in 5, the topics themselves often contain substitutable rather than                 Proceedings of the 20th ACM SIGKDD international conference on Knowledge
complimentary items. Properly blending the topics together to create                discovery and data mining. ACM, 1640–1649.
                                                                               [12] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering
a cohesive look results in better assortments, as depicted in Figure 6.             for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE
                                                                                    International Conference on. Ieee, 263–272.
                                                                               [13] Murium Iqbal, Adair Kovac, and Kamelia Aryafar. 2018. Discovering Style
5    CONCLUSION                                                                     Trends through Deep Visually Aware Latent Item Embeddings. CVPR Workshops
                                                                                    (2018).
Online shopping is a visual experience. Visually-aware recommen-               [14] Qi Liu, Yong Ge, Zhongmou Li, Enhong Chen, and Hui Xiong. 2011. Personal-
dations are crucial to online shopping and e-commerce platforms.                    ized travel package recommendation. In Data Mining (ICDM), 2011 IEEE 11th
Yet, systems which rely solely on images suffer from a lack of diver-               International Conference on. IEEE, 407–416.
                                                                               [15] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu,
sity. By incorporating both image and text data we are able to create               and Shuicheng Yan. 2012. Hi, magic closet, tell me what to wear!. In Proceedings
cohesive styles.                                                                    of the 20th ACM international conference on Multimedia. ACM, 619–628.
                                                                               [16] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
   In this paper we introduced a deep visually-aware large-scale                    2015. Image-based recommendations on styles and substitutes. In Proceedings of
assortments recommender system for Overstock 1 . Our assortment                     the 38th International ACM SIGIR Conference on Research and Development in
recommender system takes advantage of product images to create                      Information Retrieval. ACM, 43–52.
                                                                               [17] David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and An-
visually coherent trends from Overstock products. We introduced                     drew McCallum. 2009. Polylingual topic models. In Proceedings of the 2009
two variants: a visual-only variant and a multimodal variant. Our                   Conference on Empirical Methods in Natural Language Processing: Volume
visual-only variant creates a bag-of-visual-words representation of                 2-Volume 2. Association for Computational Linguistics, 880–889.
                                                                               [18] Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization.
product images by thresholding the activations from specific layers                 In Advances in neural information processing systems. 1257–1264.
of a pre-trained deep residual neural network, Resnet-50. It then              [19] Ivet Rafegas, Maria Vanrell, and Luís A Alexandre. 2017. Understanding trained
                                                                                    CNNs by indexing neuron selectivity. arXiv preprint arXiv:1702.00382 (2017).
applies topic modeling (LDA) on product image representations to               [20] Stephen Roller and Sabine Schulte Im Walde. 2013. A multimodal LDA model
create visual trends among Overstock products. We then proposed                     integrating textual, cognitive and visual modalities. In Proceedings of the 2013
a greedy approach ( with and without budget constraints ) to create                 Conference on Empirical Methods in Natural Language Processing. 1146–1157.
                                                                               [21] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep
assortment recommendations based on seed items that maximize                        content-based music recommendation. In Advances in neural information pro-
the visual compatibility of the set. Our multimodal variant takes                   cessing systems. 2643–2651.
Assortment Recommendations                                                                 SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


[22] Robin Van Meteren and Maarten Van Someren. 2000. Using content-based
     filtering for recommendation. In Proceedings of the Machine Learning in the New
     Information Age: MLnet/ECML2000 Workshop. 47–56.
[23] Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and Serge
     Belongie. 2015. Learning visual clothing style with heterogeneous dyadic co-
     occurrences. In Computer Vision (ICCV), 2015 IEEE International Conference
     on. IEEE, 4642–4650.
[24] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh
     Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings
     of the 26th Annual International Conference on Machine Learning. ACM, 1113–
     1120.
[25] Min Xie, Laks VS Lakshmanan, and Peter T Wood. 2010. Breaking out of the
     box of recommendations: from items to packages. In Proceedings of the fourth
     ACM conference on Recommender systems. ACM, 151–158.
[26] Shengli Xie and Yifan Feng. 2015. A recommendation system combining LDA
     and collaborative filtering method for Scenic Spot. In Information Science and
     Control Engineering (ICISCE), 2015 2nd International Conference on. IEEE,
     67–71.
[27] Tao Zhu, Patrick Harrington, Junjun Li, and Lei Tang. 2014. Bundle recom-
     mendation in ecommerce. In Proceedings of the 37th international ACM SIGIR
     conference on Research & development in information retrieval. ACM, 657–666.