<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM SIGIR Workshop on eCommerce, July</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Dataset (SQID ): An Image-Enriched ESCI Dataset for Exploring Multimodal Learning in Product Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marie Al Ghossein</string-name>
          <email>marie.alghossein@crossingminds.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ching-Wei Chen</string-name>
          <email>chingwei.chen@crossingminds.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Tang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Information Retrieval, Product Search, Multimodal Learning, eCommerce</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Crossing Minds</institution>
          ,
          <addr-line>San Francisco, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Stripe</institution>
          ,
          <addr-line>Toronto, ON</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>18</volume>
      <issue>2024</issue>
      <fpage>2</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Recent advances in the fields of Information Retrieval and Machine Learning have focused on improving the performance of search engines to enhance the user experience, especially in the world of online shopping. The focus has thus been on leveraging cutting-edge learning techniques and relying on large enriched datasets. This paper introduces the Shopping Queries Image Dataset (SQID), an extension of the Amazon Shopping Queries Dataset enriched with image information associated with 190,000 products. By integrating visual information, SQID facilitates research around multimodal learning techniques that can take into account both textual and visual information for improving product search and ranking. We also provide experimental results leveraging SQID and pretrained models, showing the value of using multimodal data for search and ranking. SQID is available at https://github.com/Crossing-Minds/ shopping-queries-image-dataset.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>In the age of online shopping, eCommerce platforms must help customers find what they are
looking for with the least amount of efort. Product search allows users to enter a search query,
and get back a list of results matching that query. An efective product search should be able
to understand exactly what a user is looking for, and retrieve the most relevant results from a
catalog of available items. To efectively fulfill a user’s shopping needs, a search engine must
draw on all the information it has available, including textual, visual, and contextual metadata
associated with the user, the search query, and the products in the catalog.</p>
      <p>In particular, visual information can be very useful to identify characteristics of products that
may not be well represented in textual metadata. To illustrate this point, consider the product
listing for a men’s dress shirt1, which includes textual metadata such as:
• Description: “This Stylish Men’s Collared Dress Shirt Comes in a Modern Fit Which
is Slightly More of a Tailored Fit Than a Regular Fit. It Also Features Slim Fit, Vertical
Striped Printed Pattern, Buttoned Up Closure, Turn Down Collar, Single Breasted Buttons,
Convertible Double French Cuf, Round Curved Shirttail Hem”
• Size Options: Small, Medium, Large, Extra Large
• Color Options: Black, Blue, Navy Blue White, Khaki, White Stripe Black, White/Purple</p>
      <p>Stripe, Grey Plaid</p>
      <p>If a user is looking for a “men’s dress shirt with thin vertical stripes”, they might expect that
this product is a relevant match based solely on the textual metadata. However, when looking
at the product images, they would quickly notice that the stripe pattern on the shirt is not “thin”
but rather “thick” stripes. Not only that, but many of the diferent color options in fact have a
completely diferent design and thickness of stripes, while some color options have a checkered
pattern instead of stripes (Figure 1).</p>
      <p>(a) A striped shirt</p>
      <p>(b) Same shirt, color “Khaki” (c) Same shirt, color “Gray Plaid”</p>
      <p>None of these options would be a great match for the “thin vertical stripes” the user is looking
for. However, search engines that rely only on textual metadata are likely to return these shirts
as relevant results. If, on the other hand, the search engine leveraged multimodal information
such as the product images, it might not have made that mistake.</p>
      <p>
        In order to support research on improving product search by leveraging image
information, we are releasing the Shopping Queries Image Dataset (SQID ) - an augmented
version of the Amazon Shopping Queries Dataset (SQD)2 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that includes image information
and visual embeddings for over 190,000 products, as well as text embeddings of associated
search queries so that researchers can explore the efects of multimodal learning on the
efectiveness of product search. The dataset is available at https://github.com/Crossing-Minds/
shopping-queries-image-dataset and on Hugging Face at https://huggingface.co/datasets/
crossingminds/shopping-queries-image-dataset.
      </p>
      <p>The paper is structured as follows. Section 2 presents related work around SQD and pretrained
models used to embed multimodal data. Section 3 provides the details of the data covered in SQID,
as well as the methodology followed for data collection. Section 4 presents the experimental
setting used in this paper to highlight the benefit of using multimodal data for ranking, followed
by the experimental results provided in Section 5.
2https://github.com/amazon-science/esci-data</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>This section covers work related to SQD based on which SQID is built, on one hand, and
multimodal learning techniques allowing to leverage image and text data for representing items
and products, on the other hand.</p>
      <sec id="sec-3-1">
        <title>2.1. Shopping Queries Dataset (SQD)</title>
        <p>
          In 2022, Amazon released the Shopping Queries Dataset (SQD) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], as part of the KDD Cup
challenge. This dataset includes a large number of product search queries from real Amazon
users, along with a list of up to 40 potentially relevant results for each query. Each of these
results comes with a judgment of how relevant the product is to the search query. These
judgments (E, S, C, and I) are described on the KDD Cup’22 Challenge Page3 and correspond to
Exact (E), Substitute (S), Complement (C), and Irrelevant (I) (see more details in Table 1).
The dataset was released along with three tasks4:
• Task 1 - Query-Product Ranking: Given a query and a set of retrieved products for this
query, the goal is to rank the products going from the most relevant to the least relevant,
similar to the output of a search engine.
• Task 2 - Multi-class Product Classification: Given a query and a set of retrieved products
for this query, the goal is to classify each product as part of the E, S, C, and I classes of
products.
• Task 3 - Product Substitute Identification: Given a query and a set of retrieved products
for this query, the goal is to identify the substitute products from the list of retrieved
products.
        </p>
        <p>
          In the context of this challenge, a variety of techniques were explored to improve the score
for each of the tasks, including self-distillation, data augmentation, and adversarial training,
among others [
          <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
          ].
        </p>
        <p>
          SQD has also been used to support other use cases. For instance, Tang et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] generate textual
product descriptions based on product images, use it to improve search and recommendation,
and evaluate their approach on the Task 1 of the ESCI dataset. Hou et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] introduce a set
of pretrained sentence embedding models for recommendation and trained on the “Amazon
3https://www.aicrowd.com/challenges/esci-challenge-for-improving-product-search
4https://github.com/amazon-science/esci-data?tab=readme-ov-file#introduction
Reviews 2023” dataset5, a dataset including user reviews and item metadata from Amazon. The
ESCI dataset is used to evaluate the performance of these models for conventional product
search.
        </p>
        <p>
          On another note, the TREC Product Search Track of 2023 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] leveraged SQD to create a
benchmark of retrieval methods used for product search. The dataset was enriched with
multimodal data and additional evaluation queries and labels, to make it more suitable for an
end-to-end retrieval benchmark rather than a ranking task. Compared to this work, our focus
is more aligned with the initial ranking task of the KDD Cup’22. We also document the details
of data collection and release textual and visual embeddings as well as experimental results
comparable to the ESCI benchmark [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Multimodal Pretrained Models</title>
        <p>
          Multimodal pretrained models emerged as a powerful paradigm allowing to learn joint
representations capturing the relationships between diferent types of data such as images, text, and
audio. In particular, Contrastive Language-Image Pre-training (CLIP) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] relies on a
transformerbased architecture, is trained using a contrastive learning approach, and learns to associate
images with corresponding textual data by maximizing the similarity between image-text pairs
and minimizing it otherwise.
        </p>
        <p>
          Several extensions of multimodal models were made to specifically address the item retrieval
and ranking problems in the e-commerce domain, among others, and to take into account
characteristics of user behavior. One such approach is the CLIP-ITA model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which addresses
the category-to-image retrieval task in e-commerce by leveraging textual, visual, and attribute
modalities to enhance product representations and improve retrieval performance. Another
notable approach involves conditioned and composed image retrieval based on CLIP features,
where an image is combined with a text that provides information about user intentions [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>In this paper, we rely on CLIP to embed queries and products based on text and image data.
While fine-tuning pretrained models on a dataset specific to the task is very beneficial to improve
the performance, we consider it outside of the scope of this paper and only use pretrained
models in our experiments (more details in Section 5).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Shopping Queries Image (SQID) Dataset</title>
      <sec id="sec-4-1">
        <title>3.1. Data Characteristics</title>
        <p>The Shopping Queries Image Dataset (SQID) builds upon SQD by including image information
and visual embeddings for each product, as well as text embeddings for the associated queries
which can be used for baseline product ranking benchmarking. The image information can be
used to enhance or improve the accuracy of product search algorithms by allowing them to
leverage multimodal machine learning techniques.</p>
        <p>
          The image information included in this dataset includes:
1. Image URL
2. Image Embeddings extracted using a CLIP model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], specifically clip-vit-large-patch14 6
The original SQD includes two subsets of data: a reduced set (“small_version” = 1), used
for Task 1 Query-Product Ranking, and a larger set (“large_version” = 1), used for Tasks 2 and
3. The queries are also from 3 diferent locales: “us”, “es”, and “jp”. Due to the complexity of
collecting data, we limited the scope of this dataset to the following subset of SQD:
• “small_version” = 1 (reduced set)
• “product_locale” = “us”
        </p>
        <p>The reduced set consists of 1,118,011 &lt;query, rating&gt; judgements, out of which 601,354 are
from locale “us”. These judgments contain references to 482,105 unique products (with a unique
product_id).</p>
        <p>We then mainly focus on the products found in the test set of SQD’s Task 1 (i.e., having
“split” = “test”). The total number of products appearing there is 181,701, out of which 164,900
are unique. While the rest of the paper focuses on this set of products, SQID also includes
supplementary data, covering additional products appearing in at least 2 query judgements in
the “us” locale subset of Task 1. There are 27,139 unique products meeting this criteria and that
are not in the test split.</p>
        <p>Overall, therefore, SQID covers 164,900 products, with a supplementary part covering an
additional set of 27,139 products.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Data Collection</title>
        <p>Image URLs. We scraped the Amazon website to retrieve the URL to the main product image
displayed on the product page of 164,900 products, resulting in 156,545 product_id’s having an
image URL (95%). We focused on the following domains, attempting to retrieve product pages
from each of these successively: .com, .ca, .com.au, .cn, .fr, .de, and .co.jp. There are two main
cases for when a product does not have an image URL:
• The product_id failed to return a valid product page, usually when the product is no
longer ofered on Amazon, or
• There was no image associated with the product - to be precise, the main image of the
product is a blank image that says “No image available”.</p>
        <p>There are 442 products where the image URL contains this particular URL:
https://m.mediaamazon.com/images/G/01/digital/video/web/Default_Background_Art_LTR._SX1080_FMjp_.jpg.
These are “generic” product images for digital video products where there is no product-specific
image.</p>
        <p>Textual and visual embeddings. In addition to product image URLs, SQID also
includes visual and textual embeddings of products. These were obtained using CLIP pretrained
model, specifically clip-vit-large-patch14 7, and based on product image URLs and product titles.
To address the product ranking task, we also include query embeddings obtained based on the
query text.
6https://huggingface.co/openai/clip-vit-large-patch14
7https://huggingface.co/openai/clip-vit-large-patch14</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Evaluation</title>
      <p>
        In order to illustrate the value of using multimodal data for product ranking, we leverage SQID
for the Task 1 of the KDD Cup 2022 consisting of query-product ranking [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>We evaluate the performance of several ranking approaches for the Task 1 (“small_version” =
1) on the test set (“split” = “test”) and for the US locale (“product_locale” = “us”). The evaluation
dataset consists of 181,701 judgements, 8,956 queries, and 164,900 products. The average number
of judgements per query is around 20.</p>
      <p>
        We only rely on pretrained models and consider that fine-tuning models on the ESCI training
data as well as other more advanced techniques used by winning solutions of the challenge
(e.g., [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) are outside the scope of this paper and to be investigated in future work.
      </p>
      <sec id="sec-5-1">
        <title>4.1. Metrics</title>
        <p>
          Following the setting of the challenge, the ranking quality is measured using the Normalized
Discounted Cumulative Gain (NDCG) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The four degrees of relevance of a product to a query,
defined by the labels E (Exact), S (Substitute), C (Complement), and I (Irrelevant), are attributed
respectively to the following relevance scores: 1.0, 0.1, 0.01, and 0.0. To ensure reproducibility
and follow the same guidelines as the ESCI benchmark8, we use the Terrier IR platform9 to
compute NDCG.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Ranking Approaches</title>
        <p>We first include in our evaluation two baselines for reference and to allow comparing with the
main ranking approaches considered.</p>
        <p>
          Random baseline. The random baseline is included to provide a lower-bound of NDCG for
the ranking task considered, and consists of randomly ranking products for each query.
ESCI_baseline. The  _ is the standard baseline introduced in the initial ESCI
benchmark [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. It consists of using MS MARCO Cross-Encoders10 for the “us” locale subset, a Sentence
Transformer model [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] trained on the MS Marco Passage Ranking task 11. The model is further
ifne-tuned on the training set of SQD. The query and product title are used as input for the model.
The approaches evaluated in this paper and introduced below all follow the same core
methodology for ranking: Cosine similarity is used to measure the relevance of a product to a
query, and products are then ranked in decreasing order of similarity. The main diference lies
in the models and data used to embed queries and products, used to compute similarity.
SBERT_text. We use all-MiniLM-L12-v212, a Sentence Transformers model [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], to embed
queries and products. The query text and product title are used as input for the model.
CLIP_text. We use CLIP [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], specifically clip-vit-large-patch14 13, to embed queries and
8https://github.com/amazon-science/esci-data/
9https://github.com/terrier-org/terrier-core/blob/5.x/doc/index.md
10https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2
11https://github.com/microsoft/MSMARCO-Passage-Ranking
12https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
13https://huggingface.co/openai/clip-vit-large-patch14
product titles. While the model is not specifically optimized for handling text alone, it enables
the representation of text and images in the same space, which is required by some of the
approaches considered here.
        </p>
        <p>
          CLIP_image. We use CLIP [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], specifically clip-vit-large-patch14, to embed queries and
product images.
        </p>
        <p>We also consider ranking approaches that combine both product text and images. This is done
by either combining query-product similarities or directly combining ranking lists, using a
weighted average. These approaches are designated by the notation  1_ _ 2, where  1
and  2 are the two approaches combined, and  is the method used to combine the results
from  1 and  2 (  when combining rankings and   when combining scores). A weight
parameter  is used to counterbalance the impact of text versus images. In terms of notation, 
is associated with  1 and (1 −  ) with  2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experimental Results</title>
      <p>
        Using the Terrier IR platform to compute NDCG for the  _ leads to an NDCG of
0.83, as reported in the ESCI benchmark [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, we noticed that the mapping of labels
to relevance scores is incorrectly swapped for labels S and C in the code released (see line 48
in    _  _  _ . 14). We thus corrected the label-score mapping in the evaluation,
leading to a diferent base NDCG score for the  _ .
      </p>
      <p>Figure 2 shows the results for approaches combining both text and image. The performances
of  _ ,  _ , and   _ are visualized as dashed horizontal lines on the
plot, for reference. The points at  = 0.0 correspond to the performance of   _ (with
a weight  of 0.0 for the text-based approach), and the points at  = 1.0 correspond to the
performance of the text-based approach (with a weight (1 −  ) of 0.0 for the image-based
approach). By varying the value of  , the weight of  1, the results show that combining image
14https://github.com/amazon-science/esci-data/blob/main/ranking/prepare_trec_eval_files.py
and text outperforms the approach using only text data.</p>
      <p>More specifically,   _ _  _  _ results in a 2.41% improvement
compared to the text-only approach (i.e.,   _ ),   _ _  _  _ results in
a 2.1% improvement,  _ _  _  _ results in a 0.82% improvement, and
 _ _  _  _ results in a 0.22% improvement.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>This paper presents the Shopping Queries Image Dataset (SQID), building upon the Amazon
Shopping Dataset and enriching it with image information for products. We present the dataset
and its characteristics, and provide experimental results showing the value of incorporating
image data for the task of product search. We hope that this data will support further research
around product search and ranking using multimodal data.</p>
      <p>SQID can be leveraged in the context of the ESCI benchmark, by evaluating the performance
of models using images on the ESCI test set. The data can also be used to fine-tune pretrained
models, outside of the ESCI benchmark. In addition, and as mentioned throughout the paper,
the availability of text together with images allows investigating diferent techniques around
multimodal learning relevant to the eCommerce space.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This dataset would not have been possible without the Shopping Queries Dataset by Amazon.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Màrquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Valero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Subbian</surname>
          </string-name>
          ,
          <article-title>Shopping queries dataset: A large-scale ESCI benchmark for improving product search (</article-title>
          <year>2022</year>
          ). arXiv:
          <volume>2206</volume>
          .
          <fpage>06588</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>A winning solution of kdd cup 2022 esci challenge for improving product search (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>A semantic alignment system for multilingual query-product retrieval</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2208</volume>
          .
          <fpage>02958</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zou, W. Zhang,
          <article-title>Second place solution of amazon kdd cup 2022: Esci challenge for improving product search</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gazo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bedrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Qu,</surname>
          </string-name>
          <article-title>Some practice for improving the search results of e-commerce</article-title>
          ,
          <source>arXiv preprint arXiv:2208.00108</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>McGoldrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Al-Ghossein</surname>
          </string-name>
          , C.-W. Chen,
          <article-title>Captions are worth a thousand words: Enhancing product retrieval with pretrained image-to-text models</article-title>
          ,
          <source>Proceedings of the 3rd International Workshop on Interactive</source>
          and
          <article-title>Scalable Information Retrieval methods for E-Commerce (ISIR-eCom) (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. J. McAuley</surname>
          </string-name>
          ,
          <article-title>Bridging language and items for retrieval and recommendation</article-title>
          ,
          <source>CoRR abs/2403</source>
          .03952 (
          <year>2024</year>
          ). URL: https://doi.org/10.48550/arXiv. 2403.03952. doi:
          <volume>10</volume>
          .48550/ARXIV.2403.03952. arXiv:
          <volume>2403</volume>
          .
          <fpage>03952</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kallumadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rosset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Magnani</surname>
          </string-name>
          ,
          <article-title>Overview of the trec 2023 product product search track</article-title>
          ,
          <source>arXiv preprint arXiv:2311.07861</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          ,
          <year>2021</year>
          . URL: https://api.semanticscholar.org/CorpusID:231591445.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hendriksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bleeker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          , N. van
          <string-name>
            <surname>Noord</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kuiper</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Extending clip for category-to-image retrieval in e-commerce</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>289</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baldrati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bertini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Uricchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Del Bimbo</surname>
          </string-name>
          ,
          <article-title>Conditioned and composed image retrieval combining and partially fine-tuning clip-based features</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4959</fpage>
          -
          <lpage>4968</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bertnetworks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https: //arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>