<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM SIGIR Workshop on eCommerce, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>SERGI: Similar Entity Retrieval using Grouped Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akshit Sarpal</string-name>
          <email>Akshit.Sharpal@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raviteja Uppalapati</string-name>
          <email>Raviteja.Uppalapati@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sayan Biswas</string-name>
          <email>sayan.biswas@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajesh N. Reddy</string-name>
          <email>Rajesh.Narasimha_Red@walmart.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Walmart Global Tech</institution>
          ,
          <addr-line>860 W California Ave, Sunnyvale, CA 94086</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>18</volume>
      <issue>2024</issue>
      <abstract>
        <p>Image data is frequently organized in semantic groups for entities such as e-commerce products, social media users or hotels on travel websites. There are a wide range of applications for retrieving entities based on their images, yet its exploration remains limited. Signals from images are commonly infused with other attributes in the form of embeddings, but purely leveraging groups of images for retrieval is relatively unexplored. Drawing inspiration from natural language literature, we developed an eficient and scalable method, SERGI (Similar Entity Retrieval using Grouped Images), for retrieving entities similar to given image groups. For practical implementation, we apply SERGI to an e-commerce use-case, aiming to identify products with brand misrepresentation. Despite the scarcity of benchmark methods for comparison, our system demonstrates superior performance compared to a baseline and a commonly used representation-based method, showing high precision in this relatively uncharted domain.</p>
      </abstract>
      <kwd-group>
        <kwd>Image retrieval</kwd>
        <kwd>CLIP</kwd>
        <kwd>grouped images</kwd>
        <kwd>late-interaction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Image retrieval, a process which involves searching and retrieving images from an extensive
digital database, is an integral component of various digital solutions. Traditional methods
typically rely on metadata, such as captions, keywords, or descriptions, to facilitate text-based
searchability. An alternative approach is content-based image retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (CBIR), which
preferentially employs the visual content of an image over its metadata for search and retrieval.
Instance-based image retrieval (IIR) is a more specific approach, which seeks to identify images
from a database that depict the same object or scene as a reference image.
      </p>
      <p>Despite the extensive research conducted on instance-based image retrieval within the
computer vision community, retrieval of entities represented by collections of images remains a
largely uncharted territory. In many real-world scenarios, entities are frequently represented
by groups of images, such as product listings on an e-commerce website, hotel rooms or
destinations on a travel portal, or user-visited locations on social media platforms. There
are numerous practical applications that necessitate the identification of similar products,
hotels, or places based on a given query. Existing methodologies attempt to amalgamate
signals from an entity’s images and metadata into embeddings, utilizing strategies such as
nEvelop-O
summing, averaging, or concatenating the feature embeddings (where a feature could be an
individual image). However, these techniques encounter an array of challenges like variable
or increased dimensionality resulting from concatenation, and potential loss of information,
semantic meaning, and sensitivity to noise when summing or averaging embeddings.</p>
      <p>In this study, we delve into the complex problem of entity retrieval based on groups of
images. Here, ’entity’ is a broad term used to denote a set of either query or reference images.
Given the nascent state of research in the area of grouped image retrieval, our approach draws
inspiration from methodologies employed in natural language processing for information
retrieval. Consequently, we have developed a method we call SERGI (Similar Entity Retrieval
using Grouped Images), and have applied it to a brand categorization use-case within the context
of Walmart’s e-commerce marketplace. We adapt the retrieval system to perform fine-grained
visual categorization using signals across multiple images. The observations and insights from
this application are subsequently discussed.</p>
      <p>In an e-commerce marketplace setting, a product listing is composed of numerous
characteristics. However, maintaining the accuracy of these attributes can be challenging when listings
are managed by individual sellers. In this study, we focus specifically on brand name as an
attribute. We observe that a notable percentage of listings can contain misrepresented brand
names, which can lead to poor customer experiences and adversely impact sales. Consequently,
we utilize our retrieval system to identify products with erroneous brand names and recommend
appropriate corrections. The fundamental assumption of this work is that text attributes such
as brand names are more prone to misrepresentation, whereas manipulating images is more
complex. Image manipulation not only demands more time but can also adversely afect product
sales since customers heavily rely on images when deciding to purchase. Therefore, images can
be used as an anchor to validate text attributes.</p>
      <p>We have constructed an instance of our method, SERGI, to ascertain whether a new product
listing has been misbranded, utilizing groups of images for this purpose. Brands are categorized
into two distinct segments in this study. The first segment encompasses brands that boast an
established reputation and widespread recognition, referred to herein as ’trusted brands’. The
second segment comprises lesser-known brands, designated as ’unverified brands’. For the
purpose of this study, we have constructed a scalable index that stores unique representations
of all item images associated with these trusted brands. In addition, we have established a
real-time image retrieval system that persistently monitors all images derived from items linked
to unverified brands. This system is engineered to identify entities (groups of product images)
that demonstrate a substantial degree of resemblance with the indexed images of trusted brands.
Any matches of this nature are flagged in real-time, and the corresponding trusted brand
name is suggested as the correct brand name. This enhances the eficiency and accuracy of
our brand verification process, thereby providing a robust solution to the challenge of brand
misrepresentation. We call this use-case Brand Protection.</p>
      <p>
        Our methodology employs a late-interaction architecture [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to compare images from
unverified brand items to those of trusted brand items. The unique contributions of this study,
compared with previously published work, are as follows:
• We propose a generalizable approach to conduct grouped matches. Our research
diverges from existing image retrieval methods, which typically define an entity as a
single image. Instead, we center our work on image groups. This shift in focus presents
its own set of unique challenges, which we will explore in-depth in this paper.
• We have established a highly scalable content-based image retrieval system for
ecommerce marketplaces. Image retrieval has proven successful in various applications
such as recommender systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], healthcare [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], remote-sensing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and search
engines [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our work navigates the unique challenges associated with e-commerce
catalog product images, like the presence of swatches (an image with a uniform pattern,
such as the color of a dress), and generic images like placeholder images and nutrition
labels which can be highly similar across unrelated products and lead to false matches.
An instance of our proposed method, SERGI, was set up on e-commerce data, and the
ifndings from this implementation are shared within this study.
• We show a unique application of grouped image retrieval to detect products
with misrepresented brands by adapting our retrieval system for fine-grained
classification task. Identifying products that are broadly similar to a target item, such
as a blender, shoe, or toothpaste, is relatively straightforward. However, pinpointing
products that belong to the same brand can be considerably more demanding. Products
within the same brand often possess highly similar local features, as can be seen in
multiple models of blenders from a leading kitchenware brand. Instead of striving for
improvements in image-to-image matches using local features, our methodology employs
groups of images for fine-grained visual categorization of products to a set of trusted
brands.
      </p>
      <p>We study the performance of SERGI using multiple sets of real e-commerce data from
Walmart’s marketplace and share the details of our deployment for the classification
usecase. Our results underscore the efectiveness and scalability of our approach in detecting
misrepresented brands. Beyond enhancing customer trust in e-commerce marketplaces, our
solution ofers a generalizable approach for other use-cases that require similarities between
groups of images.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        The domain of Content-Based Image Retrieval (CBIR), an extensively researched area focusing
on image matching, facilitates the retrieval of visually similar images from a specified database
with respect to a user-provided query image [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This process, in essence, involves a user
submitting a query image, following which, the system retrieves images from the database that
bear a visual resemblance to the query image. Images can be represented through a variety of
visual features such as color, texture, gradient, among others. However, the use of deep-learning
based representations have proven to be superior to traditional feature descriptors [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        The development of dense image representations benefits from a plethora of pretrained models.
The use of Convolutional Neural Network (CNN) based architectures, including VGG [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
ResNet [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Inception [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and EficientNet [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], has become widespread. Recently,
transformerbased architectures have demonstrated superior performance, surpassing previous
state-of-theart models. Among these, CLIP (Contrastive Language-Image Pre-training) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], a
transformerbased machine learning model, has been instrumental in bridging the chasm between vision
and language. CLIP is designed to understand and generate representations of images and their
accompanying textual descriptions within a joint embedding space. During training, the model
utilizes contrastive learning to associate images and texts, thereby maximizing the similarity
between accurate pairs while minimizing the similarity between incorrect pairs. One of CLIP’s
defining features is its zero-shot capabilities, which underscores its ability to understand a broad
spectrum of vision and natural language tasks without the need for finetuning. Consequently,
CLIP embeddings are employed in this study to represent images.
      </p>
      <p>
        For performing grouped similarity between the query product and the indexed products, we
inspire our work from natural language based neural retrieval. In natural language
application, there are broadly three types of matching paradigms for neural information retrieval: (a)
representation-focused rankers that independently calculate query and document
representations and calculate vector similarity [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; (b) interaction-based rankers that model relationships
across query and document tokens and match them using a neural network [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], or rankers
that model interactions across and within tokens [17]; and (c) rankers based on late-interaction
that delay the connection between query and document terms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We adapt the highly
eficient late-interaction architecture for comparing groups of images, which enables us to use
precomputed candidate representations for image retrieval.
      </p>
      <p>Previous research and applications in e-commerce are either focused on single image retrieval,
such as [18] where only the primary product image is used; or retrieval using multimodal
representations [19][20]. We explore retrieval in grouped image settings.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>In this section, we cover our SERGI method in detail. In context of image retrieval, a query is
a user’s image-based request to locate particular images from a database. In our generalized
definition, a query is an entity represented by group of images. Retrieval refers to the process of
searching for and obtaining entities that are coarsely similar to the query. We call these candidate
entities, which represent a small and relevant subset from millions of entities. Reranking refers to
the process of reordering the initially retrieved candidates based on their fine-grained relevance
to the query. We first provide an overview of the image representations used, followed by
image retrieval and the novel reranking approach. We conclude this section by summarizing
the system design for Brand Protection use-case. Table 1 summarizes common annotations
used in this section.</p>
      <sec id="sec-4-1">
        <title>3.1. Image Representation</title>
        <p>
          In traditional content-based image retrieval, a query image is provided to the system, and gets
represented in form of a feature descriptor. In neural retrievers, the representation is computed
as deep-learning based embeddings. We use Contrastive Language-Image Pre-training (CLIP)
[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which is a simplified version of the unsupervised strategy proposed in ConVIRT [ 21]. The
model consists of two parts: a vision transformer and a text transformer which are trained to
perform a contrastive prediction task. The model is trained to maximize the cosine similarity
between an image and its corresponding text in the same minibatch while minimizing the cosine
similarity with other texts or images in the minibatch.
        </p>
        <p>Given its zero-shot abilities and the fact that it can be used to generate image embeddings
that carry visual semantics, we use a pretrained CLIP checkpoint (CLIP ViT-B/32) to generate
product image representations. Image representations/embedding are normalized vectors in a
512-dimensional space. We use q to describe the query image representation.
q(i) = (</p>
        <p>img(i)) ∀  ∈ {1, … , nq}
where nq represents the number of images belonging to the query product Q.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Candidate Generation</title>
        <p>The representation is used to retrieve most similar images from an index (typically a vector
database in modern applications) that uses approximate nearest neighbor search to return
candidate images. Given a query product Q with nq images, similar images corresponding to
each query image are retrieved. This set of ns similar images across all query images is denoted
by {s}. Using a metadata database, each s(ij) is mapped to its product identifiers, and this set of
candidate products is denoted as {C}. Figure 1 displays the candidate generation process. We
summarize the process symbolically below.</p>
        <p>s(ij) =   (
{C} ←    (
q(i), I, N) ∀  ∈ {1, … , nq};  ∈ {1, … , nr}</p>
        <p>s(ij)) ∀  ∈ {1, … , ns};  ∈ {1, … , nr}
where I denotes the product index and nq represents the number of images belonging to the
query product Q.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Reranking</title>
        <p>The objective of reranking is to ensure that from all candidate items, the ones that are most
visually similar get the highest similarity scores. Typically, neural re-rankers condition the
(1)
(2)
(3)
bulk of their computations on the joint query–indexed image pair, which can be expensive in
practice. We leverage learnings from late-interaction based architecture and adapt an eficient
re-ranker that achieves state-of-the-art for natural language, to our image use-case.</p>
        <p>All images from the candidate product are queried from a low latency database. These images
are denoted by c, such that c(ij) represents ith image from jth product. As the number of images
can vary across products, a default vector, V is used to make each product in a batch uniform
in dimensions. If nc(j) represents the number of images for the jth candidate product, then
all products with nc(j) lower than (max(nc(j))) among all NC candidates are padded by V.
Maximum number of images across candidate products is represented by nm. The choice of V
can be arbitrary as long as it guarantees highest distance from any possible image vector.
nm = (</p>
        <p>nc(j)) ∀  ∈ {1, … , NC}
C(j) =  (</p>
        <p>C(j), V, nm − nc(j)) ∀  ∈ {1, … , NC}
Crep = ([</p>
        <p>C(j)]) ∀  ∈ {1, … , NC}
(4)
(5)
(6)</p>
        <p>This enables us to marshal each candidate product C(j) in set {C} as a 2-d array of size
(,   ), where d represents the embedding dimension and is taken to be 512 for our use-case.
Consequently, each C(j) is concatenated to form a 3-d array of size (  , ,   ) which represents
the corpus of all candidate product representations and is denoted as Crep.</p>
        <p>
          A Hadamard product between Q and Crep is performed, and the resultant (  ,  ,  ) array is
denoted as D. Each of the NC matrices with dimensions (  ,  ) represents a similarity matrix
between Q and the corresponding candidate. Next, a MinAvg operator is applied to each matrix
to reduce each matrix to a singular similarity score. The MinAvg operator, inspired by ColBERT
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] takes a row-wise minimum and averages the result. This is equivalent to identifying the
closest image to each query image, with replacement, calculating similarity score and averaging
across all query images to get a unified product-level score. The resultant array is denoted as S.
        </p>
        <p>D = Q ⊙ Crep
S = [MinAvg(D(j))] ∀  ∈ {1, … , NC}
(7)
(8)</p>
        <p>Similarity scores can be directly used for reranking the candidates. For our use-case as we
configured our retrieval index to use Euclidean distance, we also convert our reranking results
from similarity vector to a vector of Euclidean distances. A summary of the reranking process
is provided in Figure 2.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. System Design</title>
        <p>We provide a high-level overview of the Brand Protection pipeline. Processing millions of
products a day requires the system to be highly eficient, and we therefore choose to use the
popular retrieval – reranking framework, which enables us to perform fast candidate generation,
while efectively capturing similarities between groups of images. We have segmented the
design into two sections – indexing and inference. Indexing workflow helps onboard brands
and cover relevant products, whereas inference pipeline proactively monitors newly set up
listings for brand misrepresentation. Figure 3 shows the design of our system.</p>
        <p>Indexing refers to the process of translating raw product images to dense vector
representations and storing them to a low search-latency vector database. First, we expose a user interface
to the business stakeholders to enable onboarding “trusted” brands to the brand verification
pipeline. Users select brands which are to be onboarded on-demand, and these brand names are
stored in a database. A daily cron job selects any new brands that have been added and triggers
the indexing workflow. All items mapping to the onboarded brands are selected and each of
their images are embedded using CLIP. Product image embeddings are indexed in a high-scale
low-latency vector database that perform ultra-fast similarity matching using an approximate
nearest neighbors search. Additional product metadata required for reranking is saved in a
low-latency NoSQL database.</p>
        <p>Inference is performed on newly set up products and begins with consuming images from
an upstream image stream. Products are read in batches where each product likely has multiple
images. Image embeddings are generated for each product, and each product is now represented
by a bag of embedding vectors. There can be a variable number of images in each bag/product.
We retrieve the closest images to each query image from the index, and map them back to their
product IDs using metadata database. Unique product IDs are taken as candidate matches for
reranking.</p>
        <p>After an initial retrieval using the IIR system, reranking aims to improve the ranking of the
retrieved images by considering additional features or similarity measures. This process helps
to ensure that the most relevant images are presented at the top of the search results, enhancing
the overall performance and efectiveness of the IIR system. In our adaptation, we perform
reranking at a product level (represented by a bag of image embeddings) rather than single
image level. If the closest reranked candidate is below a preconfigured distance threshold, a
match record gets generated. An operations team reviews the match to determine if it is a truly
misrepresented brand and blocks true positives from the website.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>We describe our experimental setup evaluating various retrieval approaches and their results.
We assess the performance of three approaches on diferent subsets of e-commerce data with
distinguishing characteristics.</p>
      <sec id="sec-5-1">
        <title>4.1. Data and Preprocessing</title>
        <p>We perform experiments on items from a subset of popular brands. A sample of 1.3M products
is selected, and all unique images are onboarded to a vector database and the relevant metadata
is inserted to a low-latency NoSQL database.</p>
        <p>Tasks. We analyze the performance of SERGI for two tasks -general purpose image retrieval
and classification .</p>
        <p>Index. Two vector databases are set up for experiments – (i) image-level index containing image
representations to be used for baseline image-to-image match and SERGI ; and (ii) product-level
index with average of image representations for a product. Both indexes use Euclidean distance
and perform brute force retrieval which guarantees finding nearest neighbor candidates for
experimentation.</p>
        <p>Datasets. Four segments of product data from Walmart are prepared – (i) Eval represents an
unbiased sample of products from the same brands as indexed items; (ii) Eval-C represents a
subset of Eval which belongs to brands whose product listings are known to have low noise
and highly reliable product-to-brand mappings; (iii) Eval-N denotes a subset of Eval which is
known to contain noisy images; and (iv) Eval-Cls is prepared by adding an independent sample
of products from brands which are not indexed to Eval – this dataset is used for evaluating
classification performance using SERGI. Table 2 provides a summary of indexes and datasets.
Representations. CLIP embeddings are used for representing query and indexed images.
Preprocessing. In context of our work, noise represents signals which lead to increased
similarity between unrelated items, for example swatch images or nutrition labels. In real
e-commerce data, noise is expected and therefore we do not remove noisy images from our
data. Rather, we explore products with noisy images in more detail so as to assess robustness of
the studied approaches. Products with single images are removed as they reduce all the three
approaches to a trivial image-to-image match and may regress the results. Considering nearly
85% of our listings have multiple images, our inferences generalize to majority of the catalog.
To avoid potential leakage of the evaluation products in the index, we use a cutof date for
indexing, such that any items set up before the date get indexed. For validation, we sample
from items that are newly set up after the cutof date to avoid any leakage.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Evaluation Metrics</title>
        <p>We assess the results of our work from two perspectives – retrieval and classification . While the
primary application for our system is to be used as an IIR system for image groups, it can also
be used for certain classification tasks. Since our business use case – Brand Protection is a high
cardinality classification task, we also report on the classification performance of SERGI. For
measuring retrieval performance, we define relevance of reranked candidates as 1 if query and
candidate products are from the same brand:
relevance = {
1 if  ( Q) =  (
0 otherwise</p>
        <p>C(j)) ∀  ∈ {1, … , NC}
(9)</p>
        <p>Three metrics are used to summarize retrieval performance – Precision@K which is the
ratio of relevant candidates from the retrieved results; Mean Average Precision@K (MAP@K )
which considers the order of the returned relevant candidates and provides higher scores if
relevant candidates are ranked lower; and Mean Reciprocal Rank (MRR) which is the mean
of multiplicative inverse of the rank of the first relevant candidate. For the classification task
corresponding to Brand Protection use-case, we report precision, recall and F1 score.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Comparable Approaches</title>
        <p>As the application of grouped image retrieval is relatively unexplored, we use two alternative
approaches for comparison.</p>
        <p>Image to Image Match (I2I ). An image-to-image match system is initially implemented as
a baseline. Given a group of images representing a query product, we perform top-K (K=20)
retrieval across each query image. The resulting candidate images are reranked based on the
Euclidean distance from their closest query image. We observe that while the baseline performs
reasonably well on Eval, it lacks robustness to noisy and generic product images. This is
expected because this approach prioritizes the closest image from a group of product images,
which are often generic. I2I works well if we can identify and exclude generic images, but that
is often not viable with complex and large-scale catalogs that take inputs from multiple internal
sources as well as third party sellers. Some examples of such matches are provided in Figure 4.
Therefore, we formulate an approach that is robust to generic images.</p>
        <p>Representation-focused learning (REP). Inspired by representation-focused rankers in
information retrieval literature, we construct a consolidated representation for products to be
indexed. For each image associated with a product, we generate image embedding based on the
CLIP model. We then calculate the element-wise mean across all images to build consolidated
product embeddings. This technique mirrors the Average Query Expansion (AQE) [22] strategy,
which involves modifying the query by averaging representations of the top retrieved images.</p>
        <p>These embeddings are incorporated into the product-level index. During inference a
consolidated representation of the query product is generated in a similar manner. The final retrieval
step is conducted using the product-level index and the consolidated representation of the query
product.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Retrieval Performance Comparison</title>
        <p>We compare performance of the three approaches on three datasets. For Eval data, we see that
SERGI performs slightly better than I2I baseline. We hypothesize that the lack of meaningful
lift is because Eval data is reasonably clean and has low generic image count. Diference
between SERGI and I2I becomes more pronounced in Eval-C data. This is because Eval-C
contains products from a set of brands which have highly pronounced characteristics. Moreover,
diference between I2I and REP diminishes because average product representations for REP
become more discriminative. Each approach sees a significant lift when compared with Eval.</p>
        <p>The performance of each approach drops when using Eval-N data due to the presence of
generic images that lead to lower quality retrieval. SERGI significantly outperforms the other
two approaches, especially at lower values of K , which is important for the classification
use-case. Retrieval results are summarized in Table 3.
0.868
0.853
0.895
0.844
0.858
0.880
0.835
0.854
0.874
0.650
0.569
0.667
0.868
0.853
0.895</p>
      </sec>
      <sec id="sec-5-5">
        <title>4.5. The Classification Use-case</title>
        <p>Image retrieval systems have been explored in context of image classification by various studies.
Some studies explore a unified framework for both tasks, highlighting that both involve
measuring the similarity between the query and training or candidate images [23]. We leverage SERGI
for classification for our Brand Protection use-case. The primary objective is to classify whether
a newly set up product from an unfamous brand actually belongs to an indexed popular brand.
Consider that a candidate product belongs to a brand Bs. For adapting the retrieval results for
classifications, we tune a distance threshold T such that if Euclidean distance between the query
and a candidate product is below T then the query product is classified as belonging to brand
Bs.</p>
        <p>The precision-recall (PR) characteristics of the three approaches are illustrated in Figure
5. Given the non-uniform range of Euclidean distance, it is normalized between 0 and 1 for
consistent representation. Our classification results on Eval-Cls data are presented in Table 4.
Of the three, SERGI demonstrates the maximum area under the PR curve, followed by I2I , and
REP . We observe that while SERGI significantly outperforms other methods for low recall
regions, its precision is relatively comparable to I2I for higher recall values. This implies when
a lower, more restrictive T is selected, SERGI is expected to be relatively more precise whereas
for use-cases requiring higher recall, benefits of SERGI tend to wane.</p>
        <p>The choice of nr, and indirectly nm can also influence the nature of the PR curve. Additionally
using MinSum operator instead of the currently used MinAvg can reduce SERGI to I2I . We plan
to analyze the influence of these parameters on the PR characteristics through ablation studies
in future. Although we report the results at a recall threshold of 70% – the long-term production
target, our initial deployment prioritizes higher precision to foster stakeholder confidence.
Consequently, we reduce our recall requirement to 50%, which is expected to enhance precision
to 0.735.</p>
      </sec>
      <sec id="sec-5-6">
        <title>4.6. Deployment</title>
        <p>In this section, we delineate the deployment of the SERGI system for the purpose of Brand
Protection and subsequently discuss the results post-implementation. The SERGI system
actively monitored new products from a selection of lesser-known brands. If a monitored
product corresponded with an indexed product from a reputable and trusted brand, it was
earmarked for manual review. Our business stakeholders conducted an individual review of
each flagged product to ascertain the accuracy of the match, specifically, whether the highlighted
item truly belonged to the suggested trusted brand. The outcomes from these manual reviews
are subsequently reported.</p>
        <p>Progressive deployment. In the initial phase, we indexed products from a limited range of
fewer than 100 brands. The selection of these brands was based on internally curated databases
to ensure the inclusion of popular and reliable brands. To ensure the integrity of our database,
we used a combination of historical claims data, manual brand curation, and data from gated
brands to select a clean set of brands. As we began receiving reviews, we identified patterns
of false positives, which we promptly addressed. After a meticulous two-month period of
monitoring and active manual reviews, we deemed it appropriate to expand the system to
include additional brands. For the onboarding of these additional brands, we ofer two options
– ad-hoc onboarding for a few brands through a user interface, which helps take prompt action
for brands with recent IP related escalations; and bootstrapping process to onboard thousands of
brands and achieve scale.</p>
        <p>Prioritizing brands for onboarding. As there are millions of brands in the catalog, it is
imperative to filter trustworthy and reliable brands for indexing. Since it can be impractical to
manually annotate each brand as fit or unfit, we exploit LLMs’ parametric memory and general
understanding of natural language, and hence brand popularity to shortlist suitable brands. A
set of 20,000 brands is initially shortlisted based on business-provided criteria and a sample
from these brands is passed through a LLM powered chat interface. A suitable prompt is tuned,
which takes batches of 25 brands selected randomly, and provides relative ranking of each brand
based on its general understanding of a brand’s reach, recognition, and usage among consumers.
A few batches are sent for manual review by business, and after confirming the reliability of
the results, all 20,000 brands are passed to the LLM. We select a rank threshold (of 10) based
on manually reviewed batches, and all brands with rank equal to or below this threshold are
shortlisted for onboarding.</p>
        <p>Filtering noisy products. Despite our method proving more robust to noisy and generic
images, we observed a few instances where items escaped the robustness of our system. For
instance, we noticed products that only contain swatch images or those with only plain white
blocks, and led to a high volume of false matches. We developed a heuristic-based approach to
coarsely detect and filter such images. Filtering is performed at two stages – during indexing,
we remove noisy images; and post-inference, we exclude products for which the primary image
is noisy. We plan to undertake an ablation study to optimize the placement of this approach in
future.</p>
        <p>Results from release. During the initial two-month activation period of the system, it identified
1,424 products requiring review, of which 1,399 were subsequently assessed. Of these, 1,062
were confirmed as true positives, indicating a precision rate of 0.759. This precision rate was
fairly close to the expected precision rate of 0.735 at 50% recall from our experiments. The minor
diference between expected and observed precision may be attributed to the diference in brand
distribution between experiments and deployment, and the use of post-inference filters.</p>
        <p>The true positives represented products inaccurately listed under incorrect brand names,
adversely impacting customer experience. As a result, the review team promptly blocked these
products while notifying the marketplace sellers. If the sellers update their listings with the
accurate brand name, the product is then eligible for republishing. While we outline a specific
use-case for detecting incorrect brands, learnings from our work can be generalized to other
use-cases which require finding similar entities to a query entity.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>In this study, we thoroughly investigated an array of strategies for entity retrieval rooted in
image groupings. We introduced a unique method, drawing inspiration from the late-interaction
architecture prevalent in natural language literature, which facilitates precise and highly eficient
retrieval. We created a specialized method, SERGI, designed for the retrieval of entities similar
to the given image groups. This paper also showcases an internal application of our system
tailored for a fine-grained visual categorization, demonstrating its adaptation for tasks involving
high cardinality classification. The innovative concept of grouped entity retrieval, along with
the application of SERGI, holds significant potential for a wide range of other domains.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Acknowledgements</title>
      <p>We would like to thank the entire Walmart Item &amp; Inventory data science team for their
contributions to this project. We thank Sankalp Jain, Elizabeth Snukst, Binwei Yang, Brian
Seaman and Prakhar Mehrotra for their unwavering support and review for the paper. We
are grateful to Mohammad Qurashi for his guidance on understanding the business problem,
data landscape and working with stakeholders to identify key business opportunities. We
thank Dion Tang, Robert Caya, Kendel Jackson and William Zuniga for proactively providing
business feedback, annotations and reviews that helped with iterative enhancements of the
deployed system. We thank Arup Das for his help in detecting generic and noisy images used for
experimentation. We thank Ishan Bhatt, Yi-Sheng Yang and Quoc Tran for helpful discussions
and suggestions about addressing common catalog imagery challenges.
of text for web search, in: Proceedings of the 26th international conference on world wide
web, 2017, pp. 1291–1299.
[17] S. K. Addagarla, A. Amalanathan, Probabilistic unsupervised machine learning approach
for a similar image recommender system for e-commerce, Symmetry 12 (2020) 1783.
[18] T. Stanley, N. Vanjara, Y. Pan, E. Pirogova, S. Chakraborty, A. Chaudhuri, Sir: Similar
image retrieval for product search in e-commerce, in: Similarity Search and Applications:
13th International Conference, SISAP 2020, Copenhagen, Denmark, September 30–October
2, 2020, Proceedings 13, Springer, 2020, pp. 338–351.
[19] A. Baldrati, M. Bertini, T. Uricchio, A. Del Bimbo, Conditioned and composed image
retrieval combining and partially fine-tuning clip-based features, in: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4959–4968.
[20] M. Hendriksen, M. Bleeker, S. Vakulenko, N. Van Noord, E. Kuiper, M. De Rijke,
Extending clip for category-to-image retrieval in e-commerce, in: European Conference on
Information Retrieval, Springer, 2022, pp. 289–303.
[21] Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, C. P. Langlotz, Contrastive learning of medical
visual representations from paired images and text, in: Machine Learning for Healthcare
Conference, PMLR, 2022, pp. 2–25.
[22] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall: Automatic query expansion
with a generative feature model for object retrieval, in: 2007 IEEE 11th International
Conference on Computer Vision, IEEE, 2007, pp. 1–8.
[23] L. Xie, R. Hong, B. Zhang, Q. Tian, Image classification and retrieval are one, in: Proceedings
of the 5th ACM on International Conference on Multimedia Retrieval, 2015, pp. 3–10.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I. M.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Abdulhussain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Mahmmod</surname>
          </string-name>
          ,
          <article-title>Content-based image retrieval: A review of recent trends</article-title>
          ,
          <source>Cogent Engineering</source>
          <volume>8</volume>
          (
          <year>2021</year>
          )
          <fpage>1927469</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Colbert: Eficient and efective passage search via contextualized late interaction over bert</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kurt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Özkan</surname>
          </string-name>
          ,
          <article-title>An image-based recommender system based on feature extraction techniques</article-title>
          , in: 2017 International Conference on Computer Science and Engineering (UBMK), IEEE,
          <year>2017</year>
          , pp.
          <fpage>769</fpage>
          -
          <lpage>774</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Shamna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Govindan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Nazeer</surname>
          </string-name>
          ,
          <article-title>Content based medical image retrieval using topic and location model</article-title>
          ,
          <source>Journal of biomedical informatics 91</source>
          (
          <year>2019</year>
          )
          <fpage>103112</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kalra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Tizhoosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Diamandis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Campbell</surname>
          </string-name>
          , L. Pantanowitz,
          <article-title>Yottixel-an image search engine for large archives of histopathology whole slide images</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>65</volume>
          (
          <year>2020</year>
          )
          <fpage>101757</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Image retrieval from remote sensing big data: A survey</article-title>
          ,
          <source>Information Fusion</source>
          <volume>67</volume>
          (
          <year>2021</year>
          )
          <fpage>94</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Smelyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sandrkin</surname>
          </string-name>
          , I. Ruban,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vitalii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Romanenkov</surname>
          </string-name>
          ,
          <article-title>Search by image. new search engine service model</article-title>
          , in: 2018 International Scientific-Practical Conference Problems of Infocommunications. Science and
          <string-name>
            <surname>Technology (PIC S&amp;T)</surname>
          </string-name>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <article-title>A decade survey of content based image retrieval using deep learning</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>32</volume>
          (
          <year>2021</year>
          )
          <fpage>2687</fpage>
          -
          <lpage>2704</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N. O</given-names>
            <surname>'Mahony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harapanahalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Krpalkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Riordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walsh</surname>
          </string-name>
          ,
          <article-title>Deep learning vs. traditional computer vision</article-title>
          , in: Advances in
          <source>Computer Vision: Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1 1</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Going deeper with convolutions</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Learned-Miller</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing</article-title>
          ,
          <source>in: Proceedings of the 27th ACM international conference on information and knowledge management</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <article-title>Learning to match using local and distributed representations</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>