<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prompt-Based Fashion Outfits Retrieval and Recommender System Using Binary Hashing⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Quocdung Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoangnam Pham</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duyhung Dao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Quangmanh Do</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vanha Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FPT University</institution>
          ,
          <addr-line>Hanoi 155514</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The exponential growth of e-commerce in recent years has transformed the fashion industry, propelling it into a new era of digital retail. With the convenience of online shopping, consumers now have access to an extensive array of fashion products from the comfort of their homes and as a result in need of more eficient and personalized shopping experiences. This demand paved the way for the advancement of recommendation and retrieval systems in fashion e-commerce. In this paper, we build a system plan to streamline and enhance the retrieval of fashion outfits from vast and diverse collections. Our system consists of two components, a CLIP-like model to retrieve image items matching a textual description, and a network utilizing hashing modules for eficient personalized fashion outfit recommendations. Through extensive experimentation and evaluation, we demonstrate the efectiveness of our system in providing accurate and personalized fashion outfit recommendations with desired descriptions by the consumers, like a particular color, style, occasion, season, and many more.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fashion Retrieval</kwd>
        <kwd>Outfit Recommendation</kwd>
        <kwd>Representation Learning</kwd>
        <kwd>Hashing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The fashion industry, with its ever-evolving trends and creative expressions, is a dynamic landscape
characterized by ever-changing trends, styles, and personal preferences. The field has traditionally been
driven by the instincts and intuitions of designers, fashion houses, and trendsetters. However, the advent
of machine learning has introduced a new dimension, one where data-driven insights and algorithms
wield significant influence. This evolving relationship between technology and fashion has recently
captivated the industry, representing a profound shift. The fusion of fashion and technology holds a
multifaceted allure, grounded in several compelling factors. Machine learning, a subfield of artificial
intelligence, possesses the extraordinary capacity to extract intricate patterns from vast datasets, making
it an ideal tool for decoding the complexities of fashion. From predictive analytics that anticipates
the next big trend to personalized shopping experiences that cater to individual tastes, the potential
applications are manifold.</p>
      <p>
        One of the most captivating developments in the fashion domain in recent times is the emergence of
fashion item retrieval systems, especially in the context of composite outfits. As the number of items
within each garment category increases, the potential combinations for outfits grow exponentially.
Given the typically vast size of fashion inventories, the sheer magnitude of possible outfits that can be
curated from these items becomes orders of magnitude greater. The task of mining fashion ensembles
from an extensive inventory poses significant challenges, underscoring the necessity for intelligent
fashion recommendation techniques [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Furthermore, the concept of employing prompts for the
purpose of suggesting fashion apparel is relatively new in this field, particularly in the context of
recommending multiple harmonious items simultaneously. Consequently, our objective was to address
this challenge.
      </p>
      <sec id="sec-1-1">
        <title>1.1. Content-based Fashion Retrieval</title>
        <p>
          Content-based fashion image Retrieval (CBFIR) methods retrieved the desired fashion items or products
from the queried reference in the form of image, text, or visual clue. The predominant focus within this
task revolves around the utilization of referenced images or multimodalities (i.e., image and text) to
retrieve desired fashion products for a user. Rubio et al. leverage both the images and textual metadata
and propose a joint multi-modal embedding that maps both the text and images into a common latent
space, helping efectively perform retrieval in this space [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Shin et al. propose a style feature extraction
(SFE) layer that decomposes the clothes vector into style and category [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. They append the layer to the
Siamese CNN and train with a loss function composed of softmax loss, contrastive loss, and center loss
to predict stylish matching clothes efectively. In recent times, contrastive learning has emerged as a
prominent method for acquiring meaningful representations of concepts within the field of machine
learning. This approach is grounded in the notion that concepts with semantic connections (for instance,
two images of the same object captured from diferent angles) should exhibit similar representations,
whereas unrelated concepts should be distinctly represented. Moreover, CLIP was introduced which
represents a multimodal neural network for vision and language, trained using contrastive learning
to establish associations between visual concepts and text [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Specific to the fashion industry, Chia
et al. trained their CLIP model on a fashion dataset containing 800K products [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The model, called
FashionCLIP, is shown to learn general concepts to be transferable across tasks in the domain. We
leverage this model to retrieve fashion items from a textual description.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Outfit Recommendation using Hash learning</title>
        <p>
          Recent years have seen growing interest in developing intelligent fashion recommendation systems to
help users discover and purchase clothing and accessories that match their personal style. The number
of possible outfits grows exponentially with the number of items in each garment category. Two ways
of evaluating the compatibility of outfit items have been proposed. One approach is to use the pairwise
model compatibilities between fashion items, e.g., Siamese network [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], functional factorization [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
The other one seeks to model high-order relations among the items of an outfit, e.g., a recurrent neural
network [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Hashing techniques that learn data-driven binary codes have become popular for enabling eficient
similarity search in large-scale multimedia retrieval tasks. The aim is to maintain the nearest neighbor
relation of the original space in the hamming space. The basic idea is to preserve the similarity, i.e.,
to minimize the gap between the similarity computed in hash-coded space and the similarity in the
original space. Many methods have been introduced by learning real-valued embedding and then taking
the sign of the values to obtain binary codes.</p>
        <p>
          Due to the huge amount of fashion items, eficiency becomes an extremely important problem within
a practical recommendation system. Learning to hash has been extensively studied for eficient image
retrieval. This network models outfit compatibility through pairwise interactions and employs the
weighted hashing technique [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for matching users and items.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. The Proposed Approach</title>
      <p>
        Fig. 1 illustrates the architectural framework of FashionCLIP, which can be delineated into two
distinct phases. In the initial phase, the image encoder undertakes the task of mapping all the garment
images contained within the database into a vector space characterized by a dimensionality of 1024.
Subsequently, these resulting vectors are persistently stored within the database. In the second phase,
when a user submits a query, the text encoder proceeds to project the query into a vector sharing the
same dimensional characteristics as the image embedding vector. The prompt embedding vector is then
subjected to a dot product operation with all of the image embedding vectors, thereby facilitating the
identification of the most compatible garment. FashionCLIP uses transformers with the architecture
modifications described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as the text encoder. The image encoder is a variant of the Vision
Transformer (ViT) model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Also, the architecture of the hashing fashion model is shown in Fig. 2. It includes three components:
a feature network that extracts features, multiple type-dependent hashing modules that learn binary
codes, and a matching block that predicts preference scores. Each user is represented by a one-hot
vector indicating their index. Convolutional networks extract image features. Textual information
can optionally be used. Items from diferent categories and users are treated as diferent types. The
hashing modules contain fully connected layers with a sign function for binarization. The matching
block computes the preference score given the binary codes. The final score consists of two terms: one
considers item compatibilities and one incorporates users’ tastes.</p>
      <p>Our pipeline can be delineated as follows: given a textual prompt from a user, we employ FashionCLIP
to procure the foremost fashion products that align with the given prompt. These obtained images are
treated as a compact database, wherein all the primary apparel items are employed as queries for the
hashing fashion model. When these queries are presented, the hashing model is tasked with retrieving
supplementary items from diverse categories such as bottoms, bags, outerwear, and shoes, with the aim
of designing a cohesive ensemble.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>The Polyvore dataset provides a large-scale corpus for research on fashion outfit composition. It contains
over 1 million user-created outfits compiled from Polyvore, a popular fashion community website.
Each outfit includes fashion items of diferent categories such as tops, bottoms, and shoes that are put
together by Polyvore users. The dataset includes rich item metadata, e.g., product images, descriptions,
brands, categories, and user engagement statistics. Since its release, Polyvore has facilitated research
on outfit compatibility learning and fashion recommendation systems.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Demonstration</title>
        <p>We employ the FastAPI library for the deployment of our hashing network, while the implementation of
our model is realized within a web application using the Streamlit library. The model retrieves images
from a PostgreSQL database comprising approximately 500 randomly selected images sourced from the
Polyvore dataset. We assess the model’s performance on a modest computing platform equipped with
an Nvidia GeForce GTX 1050Ti GPU, 8GB of RAM.</p>
        <p>Fig. 3 illustrates some prototypical examples of query-based retrieval scenarios, in which a user inputs
a query into the search interface, thereby triggering the system to retrieve ensembles from the database
that are compatible with the provided query. Each row of the output exhibits an ensemble comprising
ifve distinct garment categories, namely, top, bottom, bag, outerwear, and shoe. In these instances, the
application presents a maximum of three attire recommendations for each given prompt. Notably, the
hashing network exhibits superior performance in the context of generating attire corresponding to
textual descriptions of female outfits.</p>
        <p>Conversely, when tasked with generating recommendations for male outfits, there is an observable
tendency for the hashing network to erroneously categorize certain items, particularly within the
categories of bottoms and shoes, as female garments, as exemplified in the final query. This discrepancy
arises due to the inherent bias within the training dataset, which predominantly consists of
femaleoriented products within the bottom and shoe categories.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this work, we study how to apply the CLIP model for retrieving the image based on user prompts and
study how to utilize the hashing technique for eficient personalized fashion outfit recommendations.
Although there are numerous ways to represent the compatibility of outfits, this problem needs to be
well handled to fit into hashing optimization. The system performs well in practice, however, it is not an
end-to-end solution. Future methods can be proposed for better accuracy or more eficient optimization.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Learning binary code for personalized fashion recommendation</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>10562</fpage>
          -
          <lpage>10570</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rubio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Simo-Serra</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Moreno-Noguer, Multi-modal joint embedding for fashion product retrieval</article-title>
          , in: ICIP, IEEE,
          <year>2017</year>
          , pp.
          <fpage>400</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-J.</given-names>
            <surname>Yeo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. Sagong</surname>
            ,
            <given-names>S.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Deep fashion recommendation system with style feature decomposition</article-title>
          , in: ICCE-Berlin,
          <year>2019</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Chia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <article-title>Contrastive language and vision learning of general fashion concepts</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>18958</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Veit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kovacs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McAuley</surname>
          </string-name>
          , K. Bala,
          <article-title>Learning visual clothing style with heterogeneous dyadic co-occurrences</article-title>
          , in: ICCV,
          <year>2015</year>
          , pp.
          <fpage>4642</fpage>
          -
          <lpage>4650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <article-title>Collaborative fashion recommendation: A functional tensor factorization approach</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM international conference on Multimedia</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Vasileva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Plummer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dusad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajpal</surname>
          </string-name>
          ,
          <article-title>Learning type-aware embeddings for fashion compatibility</article-title>
          ,
          <source>in: ECCV</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>390</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Binary code ranking with weighted hamming distance</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1586</fpage>
          -
          <lpage>1593</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , ArXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>