<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1145/nnnnnnn.nnnnnnn</article-id>
      <title-group>
        <article-title>A Multimodal Recommender System for Large-scale Assortment Generation in E-commerce</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Murium Iqbal</string-name>
          <email>miqbal@Overstock.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adair Kovac</string-name>
          <email>akovac@Overstock.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kamelia Aryafar</string-name>
          <email>karyafar@Overstock.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Overstock</institution>
          ,
          <addr-line>Midvale, Utah</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>E-commerce platforms surface interesting products largely through product recommendations that capture users' styles and aesthetic preferences. Curating recommendations as a complete complementary set, or assortment, is critical for a successful e-commerce experience, especially for product categories such as furniture, where items are selected together with the overall theme, style or ambiance of a space in mind. In this paper, we propose two visually-aware recommender systems that can automatically curate an assortment of living room furniture around a couple of pre-selected seed pieces for the room. The first system aims to maximize the visual-based style compatibility of the entire selection by making use of transfer learning and topic modeling. The second system extends the first by incorporating text data and applying polylingual topic modeling to infer style over both modalities. We review the production pipeline for surfacing these visually-aware recommender systems and compare them through offline validations and large-scale online A/B tests on Overstock 1. Our experimental results show that complimentary style is best discovered over product sets when both visual and textual data are incorporated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>•Information systems ! Recommender systems; Collaborative
filtering; Presentation of retrieval results; •Computing
methodologies ! Neural networks;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Overstock 1 is an e-commerce platform with the goal of creating
dream homes for all. Users browse Overstock’s catalog to select
pieces that complement one another while matching the stylistic
settings and color palettes of their rooms. As furniture is not a
disposable product, furniture purchases are subject to careful scrutiny
of aesthetics and strict budgeting. Brick and mortar stores often
inspire consumers by creating furniture showrooms, in some cases
pushing consumers to walk through their carefully selected furniture
displays. These assorted showrooms alleviate the creative and
stylistic pressure on the consumer, since the set is already on display with
all the necessary pieces.</p>
      <p>By crafting an appropriate recommender system, this experience
can be recreated on an e-commerce platform. Figure 1 illustrates
an example of the proposed system’s automatically generated
showroom, built around two seed products selected from Overstock 1.
The goal of this recommender system is to provide set
recommendations that adhere to a general theme or cohesive visual style while
accounting for essential item constraints 2. Set recommendations
have become more prominent with the rise of subscription box
services in various domains such as fashion (e.g. StitchFix 3), jewelry
(e.g. Rocksbox 4) and beauty products (e.g. Birchbox 5). The criteria
for selecting an assortment depends heavily on the product space
which can inform the latent space of user preferences and product
2Essential item constraints refers to specific products which must be included in an
assortment, e.g. a bed frame in a bedroom set or a vendor must-have in a subscription
box.
3www.stitchfix.com
4www.rocksbox.com
5www.birchbox.com
representations. Personal care product recommendations, for
example, rely on user preferences, such as skin type, which can be
inferred from product text-based attributes. Fashion, jewelry and
furniture shopping, on the other hand, are predominantly visual
experiences. This motivates a visually-aware representation for products
and style-based preferences for users.</p>
      <p>We propose two assortment recommender systems. The first
takes advantage of visually semantic features transferred from a
deep convolutional neural network to learn style. The second uses
both these visual features and product text-based attributes to learn
style across the multimodal dataset. Our hypothesis is that while
the visual style can help find similar products, use of text data in
conjunction with images will result in more complimentary, and
cohesive stylized assortments.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        As the prevalence of recommendation systems grows, e-commerce
platforms are looking to increase the influence on customer
purchases by not just recommending one item at a time but several in a
bundle together [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This is true across industries, from
travel agencies to clothing [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Specifically Zhu et al.[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
model the problem of bundle recommendations as the Quadratic
Knapsack Problem (QKP), and find an approximation to it. They
use only implicit feedback data, rather than content data to build
their system, allowing for the bundles to be personalized. Using this
approach, we also view our system as a QKP, but look to leverage
primarily content data, incorporating implicit feedback as a minor
signal. To capture a user’s preference, the system we propose allows
a user to present a seed product, around which we can create the
bundle. This allows us to circumvent the cold start problem, as
any user, regardless of having a history on our site, can still build
assortment recommendations. New products are also seamlessly
incorporated, as long as images or text data is made available.
      </p>
      <p>
        Recommendation systems is a well studied field which has two
large over arching types: collaborative filtering methods [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and content based methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] of which many rely
on learning latent representation of the input data. Topic modeling,
specifically Latent Dirichlet Allocation (LDA)[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], has also been
applied to create topic based recommendations both off of text input
as well as from implicit feedback [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. These systems
work by scoring individual products against one another or against
users to see which are most similar. Here we look to extend this
methodology by applying topic modeling over both image and text
data. Rather than simply concatenating image and text features and
linearly combining them, PolyLDA enables us to learn two distinct
but coupled latent style representations. This allows for a versatile
interpretation of style [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We then use these learned styles to make
bundle or assortment recommendations rather than the traditional
single item recommendations.
      </p>
      <p>
        To facilitate style discovery from the images, we process them
via prevalent deep learning techniques. Deep residual networks have
recently been shown as a powerful model to capture style based
preferences for creating visual wardrobes from fashion images [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
With the goal of learning visually coherent styles, we apply
transfer learning, by using a Resnet-50 6 which was pre-trained on
ImageNet 7 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We explore interpreting the convolutional neural
network by indexing the activations on channels within the
convolutional layers [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. We use a neural network to learn filters for our
data and simply index their responses to images to create visual
documents, similar to the older bag of word methods [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. To discover
visually-aware trends, we use these documents with Latent Dirichlet
Allocation (LDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        While we use LDA for topic modeling on single-modality image
data, we need an extension to interpret both visual semantic features
and text-based attribute data as complementary modalities. Roller
et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] create a multimodal topic model that assumes words and
corresponding visual features occur in pairs and should be captured
as tuples within the topic distribution. This model would work well
for descriptions of images, but the underlying assumption could
prove to be too stringent to generalize to our application. Mimno et
al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] offer a more flexible extension of LDA, Polylingual LDA
(PolyLDA), which handles tuples of documents in different
languages with assumed identical topic distributions. The documents
within a tuple do not need to be direct translations of one another,
and the topics themselves have distinct sets of words for each
language. By treating our image data and our text data as two separate
languages, we can use PolyLDA directly to create a more flexible
multimodal topic model with which to infer style.
      </p>
      <p>
        Hsiao and Grauman [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] also apply PolyLDA to infer style from
images, but they use complementary clothing types (e.g. pants
and blouse) as languages in a compatibility model instead of using
PolyLDA to handle different modalities of data for the same item.
The documents they use for their topic models are generated using a
Resnet trained to automatically detect their pre-defined text attributes
within images and apply the appropriate labels.
      </p>
      <p>Much of the work performed on traditional recommendation
systems can be applied here, but additional constraints must be taken
into consideration. Products can’t simply be visually similar to be
purchased together, but must be complimentary as well. For
example, having two similar brightly colored upholstered chairs as a result
in a traditional recommendation is perfectly acceptable, but if they</p>
      <sec id="sec-3-1">
        <title>6https://github.com/facebook/fb.resnet.torch 7http://www.image-net.org/</title>
        <p>
          are both taken into one assortment with the intention of being bought
together, they may clash. Here we show that by incorporating text
data, we are able to avoid assortments of incredibly visually similar,
interchangeable products which are not at all complimentary. These
nuances can be further addressed by incorporating purchase data to
infer what types of products are not just similar, but also compatible,
and likely to be bought together. To create the final assortment
recommendations, we define a distance measure that combines stylistic
similarity with implicit feedback from users, in the form of
purchases, to develop a concept of affinity similar to approaches in
composite recommendation systems such as [
          <xref ref-type="bibr" rid="ref10 ref23 ref27 ref9">9, 10, 23, 27</xref>
          ].
        </p>
        <p>This rest of this paper is organized as follows: In Section 3, we
present our production pipeline for assortment recommendations as
deployed in Overstock 1 and explore our product embeddings as used
with topic modeling techniques. We also review our formulation of
assortment generation as a 0-1 quadratic knapsack problem and our
greedy optimization for building assortments. Section 4 presents
our offline and online experimental results. Finally, we conclude our
paper in Section 5 and propose future directions.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>METHODOLOGY</title>
      <p>Here we provide an overview of the production system used to
surface recommendations throughout Overstock 1 and describe our
assortment recommender system in detail. We first explain our
product embeddings as text-based bag-of-words and
bag-of-visualwords (BoVW) documents. Then we explain the LDA approaches
used to discover text-based and visual styles across our platform.
We finally discuss our formulation of assortment generation as a 0-1
quadratic knapsack problem with budget constraints and our greedy
assortment recommendations algorithm.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>System Overview</title>
      <p>Product recommendations on Overstock 1 are surfaced through
multiple carousels across the platform. User interactions with products
and recommendations is recorded along side purchase data. Each
product on the platform is represented by corresponding title, text
attribute tags, text-based descriptions and images. The first step
in generating recommendations is processing the large volume of
product and user data which are hosted on Hadoop Servers.</p>
      <p>After initial processing on Hadoop, the data is transfered to a
onebox CUDA-enabled GPU server. Here product images are passed
through a publicly available Resnet-50 to create image-based
documents. Both the image and text documents are then fed into
Mallet 8 to generate topic representations of products. We create
two topic variants, a visual variant, that uses the image data with
LDA and a multimodal variant, that uses both text and image data
with PolyLDA.</p>
      <p>Once the topic distributions are created, the products reside within
their defined feature space. Hereafter, the methodology for both
variants is identical. We construct a distance measure over the topic
space, using implicit feedback from our users, to model
compatibility. The assortments are then generated by finding products which
minimize this distance to predefined seed product pairs.</p>
      <p>The generated assortments are pushed to our Cassandra database
cluster. Our website servers can query Cassandra directly to see</p>
      <sec id="sec-5-1">
        <title>8http://mallet.cs.umass.edu/</title>
        <p>which products have an associated assortment to present to users.
Assortment recommendations are featured on corresponding product
pages, below the product description, providing an option for users
to purchase an entire assortment, or just parts of it. A brief diagram
of the systems involved is presented in Figure 2.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Product Embeddings</title>
      <p>Products on Overstock 1 have several images of the item for sale and
different forms of associated text data including title, description and
attribute tags. Product attributes are descriptive tags associated with
a product. Some example attribute categories are color, size, style,
composing materials, finish, and brand. In this paper, we utilize
these attributes as the primary text-based information since they
often provide a rich text representation of the item that compliments
provided images. All product attribute data is processed to remove
stop words and are concatenated with each product title to form a
bag-of-words document.</p>
      <p>
        We use the methodology described in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to create visual
documents to be used with LDA via a Resnet. We then combine both
visual and text data together via PolyLDA.
      </p>
      <p>
        The convolutional layers within a Resnet are composed of a series
of small learned filters. Each filter is convolved with the input in
steps along it’s height and width. The result of this process is a
2-dimensional grid of activations that represents the response from
the filter at corresponding locations of the input. The learned filters
within a Resnet respond to specific patterns. By viewing these
filters as words, and indexing them, Iqbal et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] show that a
BoVW document can be created which, when combined with LDA,
is effective at uncovering style. To index a layer, we simply threshold
the response from the filter to indicate whether or not it is sufficiently
activated by the image to be included within the image’s document.
      </p>
      <p>Given this methodology for creating image documents, our visual
word vocabulary is defined entirely by the channels in the layers that
we choose to index. We tested creating documents using various
layers through the network. We found that combining several middle
layers gave us the best results. After experimenting we empirically
chose to use Resnet-50 layers 8, 18, 31, and 43. The combination
of these layers provided significantly better results, per visual
inspection, than any single layer or other combinations. This yielded a total
vocabulary size of 2816 visual words over our image documents.
Since multiple images can be provided for each product, we take the
union of all present visual words within the corresponding images as
the visual document for the product itself. The process is depicted
in Figure 4.
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>Topic Modeling</title>
      <p>
        Once the visual documents are created, they can be used directly with
LDA to create topics for our visual variant. LDA is an unsupervised
generative model which infers probability distributions of words
within topics and models documents as a mixture of those topics
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Given a set of documents and a number of target topics the
model assumes the following generative process was used to create
the documents.
      </p>
      <p>(1) For M documents, initialize the set of topic distributions,
θi Dir¹α º for {i = 1 : : : Mg
(2) For K topics, initialize the set of word distributions, ϕk</p>
      <p>Dir¹β º for {k = 1 : : : Kg
(3) For the kth word in the ith document select a topic zi from
θi and a word, wi;k , from ϕzi
where Dir is the Dirichlet distribution.</p>
      <p>The model initializes the parameters randomly then iteratively
updates these parameters via Gibbs Sampling and variational Bayes
inference. After many iterations the topic distributions converge to
a stable state and the resulting topics can be used as a low
dimensional feature space that captures the salient content of the original
documents.</p>
      <p>
        For the multimodal variant, each visual document is paired with
it’s text equivalent in a tuple. These tuples are then used with
Polylingual LDA (PolyLDA) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Products that are missing either a text
document or a visual document can still be used with PolyLDA, as
it is robust to missing documents within tuples. A diagram of the
system for multimodal document creation is provided in Figure 4.
PolyLDA is an extension of LDA meant to handle loosely equivalent
documents in different languages. It takes as input tuples of
documents, each in a different language but with the same exact topic
distribution. Each resulting latent topic from this method contains a
distinct associated set of word distributions for each language.
Documents aren’t required to be direct translations of one another, which
allows for flexibility. A word in one language could be attributed
to a given topic while its direct translation in another language may
be attributed to a different topic. Words that only appear in one
language can also be present in topics despite having no equivalent
translations in the corpus. PolyLDA extends LDA by assuming the
following generative procedure:
(1) For a given tuple of documents, fd1 : : : dL g, initialize a
single set of topic distributions, θ Dir¹α º
(2) For K topic sets with L languages, initialize the set of word
distributions, ϕk;l Dir¹β º for fk = 1 : : : K g and fl =
1 : : : Lg
(3) For the kth word in the lth document in the ith tuple select
a topic zi;l from θi and a word, wi;l;k , from ϕzi;l
      </p>
      <p>For our multimodal variant, we make use of the flexibility
afforded by PolyLDA by applying it to our visual and text documents,
representing each modality as a distinct language attempting to
describe the same product. This method should allow for topics that
afford better generality. Intuitively for our corpus, we suspect that
a topic could capture relationships between certain visual features,
like thick wood, with textual attributes, like "quality", that don’t
have a direct visual representation but do share an underlying
contextual relationship. By combining modalities the model also affords
us the ability to infer a more complete style representation given
incomplete data. For example, we can apply our multimodal topics
to products with only text attributes available and still infer style that
would normally be captured by the missing image data.</p>
      <p>We provide visualizations of the resulting topics from both
variants in Figure 5. To create these visualizations we selected the
products that are maximally aligned with each topic. In our case,
we have empirically seen that the topics relate well to various styles
while reducing our feature space from a large number of attribute
tags, title words, and image-based channel activations to a succinct
k-dimensional space (where k is the number of topics). These styles
are not necessarily complimentary though, especially in the case of
the visual topics. We can see that the items within the topics seem
more substitutable, e.g. in the cases of brightly colored furniture,
similarly patterned upholstery, and mirrored surfaces.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Knapsack Problem</title>
      <p>Once we have all our products residing within the topics’ feature
space, we can begin to build out our assortments. The process for
generating the assortments is the same for both the visual variant and
the multimodal variant, with the only difference between the two
being the topic representations of products. To build these
assortments, we first define the necessary components of an assortment as
verticals. The verticals we define here enable us to build living room
assortments, but we would like to emphasize that verticals can be
defined for any e-commerce platform. If desired, the verticals can
be left undefined, and the assortment can be built with no vertical
constraints. For our specific application verticals are critical pieces
of living room furniture, e.g. coffee table, chair, or accent table. Our
definitions are manually applied to the products and are listed below.
(Here a couch set can be either a sectional sofa or a sofa and loveseat
combination.)</p>
      <p>Verticals: {Couch Set, Coffee Table, Accent Table, Entertainment
Center, Bookshelf, Ottoman, Chair}</p>
      <p>We chose to build our assortments around seed pieces. For our
current application, we generated seeds, but we can easily allow
users to provide their own seeds around which we can curate an
assortment. We selected seeds as preferred pairings of the most crucial
verticals in any living room space, namely a couch set and a coffee
table. Although all the other verticals can be seen as optional, these
two verticals are defining members of a living room and therefore
must be present in any assortment. We chose pairs of sofas and
coffee tables which are most frequently co-clicked for our seeds.</p>
      <p>
        We then assume a budget constraint for the entire assortment. As
such we can formulate our assortment generation as a 0-1 quadratic
knapsack problem (0-1 QKP) as defined by Gallo et al [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The 0-1
QKP assumes that a knapsack with limited space needs to be filled
with a set of items. These items all have an intrinsic value. Items
also have an additional associated pair-wise profit when selected
together for the knapsack.
      </p>
      <p>For a given seed, we can find the optimal assortments by trying to
find the products that are nearest the seed in the topic space. We can
add the additional constraint that we want all products within the
assortment to be nearest one another in the topic space as well. The
total assortment must remain within the assumed budget constraint.
Each product has an associated vertical, and we can add in some
constraints on the minimum and maximum number of products each
vertical can contribute to the final assortment.</p>
      <p>
        To calculate proximity we use a Mahalanobis distance built on
purchase data similar to [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The Mahalanobis distance allows us
to leverage the implicit feedback of our users to understand how our
topics distribute and relate to one another across multiple products
within a purchase.
      </p>
      <p>We assume that purchases of living room furniture made by the
same user over a 3 month period are being used in the same room,
and as such can be said to represent a compatible assortment. After
trimming to include only purchases with 3-10 pieces of living room
furniture, so as to avoid noise and erroneous signals from bulk
purchasers, we create a dataset of roughly 7.5k purchased assortments.
We take the L2 normalized sum of the topic distributions of the
copurchased products as a topic representation of the purchase itself,
and use this dataset to calculate the covariance matrix used for our
Mahalanobis distance.</p>
      <p>The Mahalanobis distance is defined as dM (xi ,xj ) = (xi -xj )M(xi
xj )T , Where M is the covariance matrix for the topics learned from
our purchase data. For each seed pair of coffee table and couch set,
we greedily swap products in and out of our assortment until we</p>
      <sec id="sec-8-1">
        <title>Algorithm 1 Generate Assortments</title>
        <p>δ
converge to an ideal set of products. The only verticals which are
not allowed to change are those in the seed.</p>
        <p>To formulate the O-1 QKP we first define some terminology. Each
product o will have a vertical label ao , an associated price co , and
associated topic distribution to . All products will have a pairwise
distance associated with them and every other product di; j which
is the Mahalanobis distance between products oi and oj built on
the purchases and topic distributions. Additionally the products all
have a score qo to represent their value to the seed, here we use the
inverse of the distance from the product to the seed. Let M be the set
of products comprising the assortment. Let B be the total available
budget minus the cost of the seed. The 0-1 knapsack problem can
then be formulated as:</p>
        <p>A greedy approximation can now be formulated to follow the
constraints of the 0-1 QKP. First a solution is initialized, then iteratively,
items are swapped until the system converges to an optimal solution.
To get the initial set, products are all scored by their total potential
for compatibility (the sum of the score of this product with the sum
total of all reciprocals of pairwise Mahalanobis distance) divided
by their price. Products are then sorted by this ratio and the highest
scored are added until vertical constraints or budget constraints are
met. Then each product is considered for a swap with other products
to improve the total compatibility of the assortment.
3.5</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Greedy Method</title>
      <p>For our online evaluation, we relax the budget constraints, so that we
can display the assortment on product pages. This allows us to select
the best assortment per product. We can easily reapply the budget
constraint for user assortments. As such, we use the following greedy
method to generate the assortments. The assortment is initialized
as first only containing the seed items, all other verticals are empty.
Each vertical is then iteratively considered for new products while
holding all other verticals constant. New products are chosen from
(1)
the pool of products that are labeled as candidates for this vertical.
Those which are closest to the total assortment (minus the current
vertical) at the previous time-step are added in, until the size of the
vertical is met. The sizes of the verticals in our experiment were
set by us, but a user can dictate how many side tables, bookshelves,
chairs, etc they wish to add to their assortment.</p>
      <p>Let V be the set of all verticals. S is the set of verticals used in the
seed. Ai is the set of products selected for the assortment in vertical i
for the current timestep. Each vertical i has an associated size, isize ,
which is the number of products we want for this vertical. Ai;pr ev
is the set of products selected for the assortment in vertical i for the
previous timestep. pi is an element of Ai;pr ev . oi is a candidate
product for vertical i. to is the topic distribution of a product. dM is
the Mahalanobis distance as defined above. δ is the change in the
distance of the assortment from iteration to iteration. The greedy
approach is described using this notation in Algorithm 1. Figure 6
illustrates the generated assortments using this method for both the
visual and multimodal variants.
4</p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS AND DISCUSSION</title>
      <p>This section describes a large-scale experiment on Overstock 1 to
determine whether simultaneously learning style from text and
images provides better results than learning style from only images.
All offline validations and online A/B test assume 2 variants: a
visual-only assortment recommender system, which is built on top
of BoVW representation of product images, and a multimodal one,
which is built on text-based attributes and visual words. Our site
does offer manually curated collections, but there is little overlap on
the products selected in the manual process and those selected for
our automated system. The methodology used to create these
collections also involves selecting products from the same vendor from
the same product line, which can often be viewed as substitutable
products as well. These would thus serve as a poor comparison
to our model, which attempts to find stylistically complimentary
products of different categories regardless of vendor, rather than
similar/identical products. As such we do not use them as a baseline
to compare our model against.</p>
      <p>Online Evaluation. We run an A/B test on Overstock 1 with both
the visual variant and multimodal variant assortments on product
pages. The users are provided an option to add an entire assortment
or individual items from the recommendations module to their cart.
User engagement with recommendations are measured via
clickthrough rate, CTR, on assortment recommendations is a classic
measure of user engagement with product recommendations in
ecommerce platforms. Our findings show that the multimodal variant
outperforms the visual variant, showing a statistically significant lift
of 10.9% relative to a visual only baseline.</p>
      <p>Offline Evaluations. As we are recommending sets of products
together, we can’t use traditional offline evaluations, such as AUC,
MAP and Recall. For scoring, we have taken the assortments and
scored them with the average click-based Jaccard coefficient
calculated pairwise over all products contained within the assortment. We
make use of logs of product clicks over the last 2 months to build
this score. This tells us how compatible the various products within
an assortment are to one another.</p>
      <p>To calculate the assortment Jaccard, we look at S, the set of all
user browsing sessions that included clicks on at least 2 items of
living room furniture. A user browsing session only spans one visit,
so we can assume that when items are co-clicked within the same
session, the user still has the same intent, and is implicitly seeing
both products as satisfying this intent. We take all user browsing
sessions over a two month span for our dataset. We denote the subset
of S that includes sessions that have clicked on product a as Sa .</p>
      <p>We calculate the Jaccard coefficient for all pairs of products (a, b)
within our corpus as follows:</p>
      <p>J ¹a; bº = jSa \ Sb j</p>
      <p>jSa [ Sb j
In contrast to the standard Jaccard formulation, if jSa [ Sb j = 0,
we consider J ¹a; bº = 0. This modification is to prevent products
that were never visited together from being counted as perfectly
compatible with each other, since the standard formulation defaults
these cases to 1.</p>
      <p>For assortment A we calculate an assortment Jaccard score JA
as a simple average of the pairwise Jaccard coefficient for all items
within the assortment, excluding the Jaccard coefficient between
the two seeds. An example assortment with a good Jaccard score is
shown in Figure 1.</p>
      <p>The multimodal maintains a much higher average assortment
Jaccard than the visual-only variant. The dramatic difference between
the multimodal variant and the visual variant may reflect an
underlying bias in the click data: The text modality looks at the same data
used in site search and navigation to surface groups of products to
users, so co-click data may reflect better discoverability of items
with similar textual features as opposed to similar visual features.</p>
      <p>We also examine the distribution of topics from each assortment.
The topic representation of products in our PolyLDA generated
space is more diverse than that of the LDA. The products in the
LDA space usually are strongly attributed to one or two products,
while the PolyLDA does a better job of representing a product as
a mixture of topics. This affects style as the products within the
topics that are learned are very visually similar, lending themselves
to substitutability rather than complimentary behavior.</p>
      <p>Figure 7 shows how many topics compose each assortment. The
distribution for the visual variant is heavily skewed, with the majority
of assortments only being composed of a few topics. The multimodal
s
t
n
e
m
tr
o
s
s
A
l
a
t
o
T
1:5
variant has much more variety, with assortments composed of only
a few topics, to assortments composed of nearly half the topics
being equally likely. This shows us that the multimodal variant
offers more diverse assortments. This is preferred since, as depicted
in 5, the topics themselves often contain substitutable rather than
complimentary items. Properly blending the topics together to create
a cohesive look results in better assortments, as depicted in Figure 6.
5</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION</title>
      <p>Online shopping is a visual experience. Visually-aware
recommendations are crucial to online shopping and e-commerce platforms.
Yet, systems which rely solely on images suffer from a lack of
diversity. By incorporating both image and text data we are able to create
cohesive styles.</p>
      <p>In this paper we introduced a deep visually-aware large-scale
assortments recommender system for Overstock 1. Our assortment
recommender system takes advantage of product images to create
visually coherent trends from Overstock products. We introduced
two variants: a visual-only variant and a multimodal variant. Our
visual-only variant creates a bag-of-visual-words representation of
product images by thresholding the activations from specific layers
of a pre-trained deep residual neural network, Resnet-50. It then
applies topic modeling (LDA) on product image representations to
create visual trends among Overstock products. We then proposed
a greedy approach ( with and without budget constraints ) to create
assortment recommendations based on seed items that maximize
the visual compatibility of the set. Our multimodal variant takes
advantage of text-based product attributes in addition to image
representation. This variant utilizes Polylingual LDA (PolyLDA) to
create trends that are based on two modalities, images and text.</p>
      <p>We have featured multiple assortments generated from both
models. We have evaluated our results through a set of offline validations
and an online large-scale A/B test on Overstock. Our experimental
results indicate that incorporating both image and text data provides
more a more cohesive visual style than using only images and can
enhance user engagement metrics with recommendations module.
We also show that PolyLDA provides a meaningful way to
simultaneously learn style across text and image data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Moran</given-names>
            <surname>Beladev</surname>
          </string-name>
          , Lior Rokach, and
          <string-name>
            <given-names>Bracha</given-names>
            <surname>Shapira</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Recommender systems for product bundling</article-title>
          .
          <source>Knowledge-Based Systems</source>
          <volume>111</volume>
          (
          <year>2016</year>
          ),
          <fpage>193</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sonia</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          , Laura Po, and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Sorrentino</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Comparing Topic Models for a Movie Recommendation System.</article-title>
          .
          <source>In WEBIST (2)</source>
          .
          <fpage>172</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>David</surname>
            <given-names>M Blei</given-names>
          </string-name>
          , Andrew Y Ng, and
          <string-name>
            <given-names>Michael I</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research 3</source>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          (
          <year>2003</year>
          ),
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tiago</given-names>
            <surname>Cunha</surname>
          </string-name>
          ,
          <source>Carlos Soares, and André CPLF Carvalho</source>
          .
          <year>2017</year>
          .
          <article-title>Metalearning for Context-aware Filtering: Selection of Tensor Factorization Algorithms</article-title>
          .
          <source>In Proceedings of the Eleventh ACM Conference on Recommender Systems. ACM</source>
          ,
          <volume>14</volume>
          -
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Giorgio</given-names>
            <surname>Gallo</surname>
          </string-name>
          ,
          <string-name>
            <surname>Peter L Hammer</surname>
            , and
            <given-names>Bruno</given-names>
          </string-name>
          <string-name>
            <surname>Simeone</surname>
          </string-name>
          .
          <year>1980</year>
          .
          <article-title>Quadratic knapsack problems</article-title>
          . In Combinatorial optimization. Springer,
          <fpage>132</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Garfinkel</surname>
          </string-name>
          , Ram Gopal, Arvind Tripathi, and
          <string-name>
            <given-names>Fang</given-names>
            <surname>Yin</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Design of a shopbot and recommender system for bundle purchases</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>42</volume>
          ,
          <issue>3</issue>
          (
          <year>2006</year>
          ),
          <fpage>1974</fpage>
          -
          <lpage>1986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Prem</given-names>
            <surname>Gopalan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jake M Hofman</surname>
          </string-name>
          , and David M Blei.
          <year>2015</year>
          .
          <article-title>Scalable Recommendation with Hierarchical Poisson Factorization.</article-title>
          .
          <source>In UAI</source>
          .
          <volume>326</volume>
          -
          <fpage>335</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ruining</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Charles Packer</surname>
          </string-name>
          , and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning compatibility across categories for heterogeneous item recommendation</article-title>
          .
          <source>In Data Mining (ICDM)</source>
          ,
          <source>2016 IEEE 16th International Conference on. IEEE</source>
          ,
          <fpage>937</fpage>
          -
          <lpage>942</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Wei-Lin</surname>
            <given-names>Hsiao</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Kristen</given-names>
            <surname>Grauman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Creating Capsule Wardrobes from Fashion Images</article-title>
          .
          <source>arXiv preprint arXiv:1712.02662</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Diane J Hu</surname>
            , Rob Hall, and
            <given-names>Josh</given-names>
          </string-name>
          <string-name>
            <surname>Attenberg</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Style in the long tail: Discovering unique interests with latent variable models in large scale social e-commerce</article-title>
          .
          <source>In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM</source>
          ,
          <volume>1640</volume>
          -
          <fpage>1649</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Yifan</surname>
            <given-names>Hu</given-names>
          </string-name>
          , Yehuda Koren, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Volinsky</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Collaborative filtering for implicit feedback datasets</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2008</year>
          . ICDM'08. Eighth IEEE International Conference on. Ieee,
          <volume>263</volume>
          -
          <fpage>272</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Murium</surname>
            <given-names>Iqbal</given-names>
          </string-name>
          , Adair Kovac, and
          <string-name>
            <given-names>Kamelia</given-names>
            <surname>Aryafar</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Discovering Style Trends through Deep Visually Aware Latent Item Embeddings</article-title>
          .
          <source>CVPR Workshops</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Qi</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Yong Ge,
          <string-name>
            <given-names>Zhongmou</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Enhong</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hui</given-names>
            <surname>Xiong</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Personalized travel package recommendation</article-title>
          .
          <source>In Data Mining (ICDM)</source>
          ,
          <source>2011 IEEE 11th International Conference on. IEEE</source>
          ,
          <fpage>407</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Si</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Jiashi Feng, Zheng Song, Tianzhu Zhang, Hanqing Lu, Changsheng Xu,
          <string-name>
            <given-names>and Shuicheng</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Hi, magic closet, tell me what to wear!</article-title>
          .
          <source>In Proceedings of the 20th ACM international conference on Multimedia. ACM</source>
          ,
          <volume>619</volume>
          -
          <fpage>628</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Julian</surname>
            <given-names>McAuley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Targett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qinfeng</given-names>
            <surname>Shi</surname>
          </string-name>
          , and Anton Van Den Hengel.
          <year>2015</year>
          .
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>43</volume>
          -
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>David</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hanna M Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jason</given-names>
            <surname>Naradowsky</surname>
          </string-name>
          , David A Smith,
          <string-name>
            <given-names>and Andrew</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Polylingual topic models</article-title>
          .
          <source>In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2. Association for Computational Linguistics</source>
          ,
          <fpage>880</fpage>
          -
          <lpage>889</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Andriy</given-names>
            <surname>Mnih and Ruslan R Salakhutdinov</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Probabilistic matrix factorization</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>1257</volume>
          -
          <fpage>1264</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Ivet</surname>
            <given-names>Rafegas</given-names>
          </string-name>
          ,
          <source>Maria Vanrell, and Luís A Alexandre</source>
          .
          <year>2017</year>
          .
          <article-title>Understanding trained CNNs by indexing neuron selectivity</article-title>
          .
          <source>arXiv preprint arXiv:1702.00382</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Roller</surname>
          </string-name>
          and Sabine Schulte Im Walde.
          <year>2013</year>
          .
          <article-title>A multimodal LDA model integrating textual, cognitive and visual modalities</article-title>
          .
          <source>In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          .
          <fpage>1146</fpage>
          -
          <lpage>1157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Aaron</surname>
            <given-names>Van den Oord</given-names>
          </string-name>
          , Sander Dieleman, and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Schrauwen</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Deep content-based music recommendation</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          .
          <volume>2643</volume>
          -
          <fpage>2651</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Robin</surname>
            <given-names>Van Meteren</given-names>
          </string-name>
          and
          <string-name>
            <surname>Maarten Van Someren</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Using content-based filtering for recommendation</article-title>
          .
          <source>In Proceedings of the Machine Learning in the New Information Age: MLnet/ECML2000 Workshop</source>
          . 47-
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Andreas</surname>
            <given-names>Veit</given-names>
          </string-name>
          , Balazs Kovacs, Sean Bell,
          <string-name>
            <surname>Julian</surname>
            <given-names>McAuley</given-names>
          </string-name>
          , Kavita Bala, and
          <string-name>
            <given-names>Serge</given-names>
            <surname>Belongie</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning visual clothing style with heterogeneous dyadic cooccurrences</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <source>2015 IEEE International Conference on. IEEE</source>
          ,
          <fpage>4642</fpage>
          -
          <lpage>4650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Kilian</surname>
            <given-names>Weinberger</given-names>
          </string-name>
          , Anirban Dasgupta, John Langford, Alex Smola, and
          <string-name>
            <given-names>Josh</given-names>
            <surname>Attenberg</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Feature hashing for large scale multitask learning</article-title>
          .
          <source>In Proceedings of the 26th Annual International Conference on Machine Learning. ACM</source>
          ,
          <volume>1113</volume>
          -
          <fpage>1120</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Min</surname>
            <given-names>Xie</given-names>
          </string-name>
          , Laks VS Lakshmanan, and
          <string-name>
            <surname>Peter T Wood</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Breaking out of the box of recommendations: from items to packages</article-title>
          .
          <source>In Proceedings of the fourth ACM conference on Recommender systems. ACM</source>
          ,
          <volume>151</volume>
          -
          <fpage>158</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Shengli</given-names>
            <surname>Xie</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yifan</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A recommendation system combining LDA and collaborative filtering method for Scenic Spot</article-title>
          .
          <source>In Information Science and Control Engineering (ICISCE)</source>
          ,
          <year>2015</year>
          2nd International Conference on. IEEE,
          <fpage>67</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Tao</surname>
            <given-names>Zhu</given-names>
          </string-name>
          , Patrick Harrington,
          <string-name>
            <given-names>Junjun</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Lei</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Bundle recommendation in ecommerce</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval. ACM</source>
          ,
          <volume>657</volume>
          -
          <fpage>666</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>