<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cherag Aroraa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tracy Holloway King</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jayant Kumar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Lu</string-name>
          <email>yil@adobe.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanat Sharma</string-name>
          <email>sanatsha@adobe.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arvind Srikantan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Uvalle</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josep Valls-Vargas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harsha Vardhan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adobe Inc.</string-name>
          <email>charora@adobe.com</email>
          <email>hmatadaallam@adobe.com</email>
          <email>jvallsvargas@adobe.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>As user content and queries become increasingly multi-modal, the need for efective multi-modal search systems has grown. Traditional search systems often rely on textual and metadata annotations for indexed images, while multi-modal embeddings like CLIP enable direct search using text and image embeddings. However, embedding-based approaches face challenges in integrating contextual features such as user locale and recency. Building a scalable multi-modal search system requires fine-tuning several components. This paper presents a multi-modal search architecture and a series of AB tests that optimize embeddings and multi-modal technologies in Adobe Express template search. We address considerations such as embedding model selection, the roles of embeddings in matching and ranking, and the balance between dense and sparse embeddings. Our iterative approach demonstrates how utilizing sparse, dense, and contextual features enhances short and long query search, significantly reduces null rates (over 70%), and increases click-through rates (CTR). Our findings provide insights into developing robust multi-modal search systems, thereby enhancing relevance for complex queries.</p>
      </abstract>
      <kwd-group>
        <kwd>multi-modal search</kwd>
        <kwd>text-image embeddings</kwd>
        <kwd>hybrid search techniques</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        For search over images and multi-modal content, industry search systems traditionally rely on textual
and metadata annotations added to indexed images. However, multi-modal embeddings like CLIP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
enable direct search of image content using text and image embeddings, allowing for direct text-to-image
and image-to-image search. While pure embedding-based approaches facilitate content understanding,
they struggle with integrating contextual features like user locale and recency into retrieved results.
Building a production-grade, scalable, multi-modal search system involves carefully tuning several
components. This paper describes a series of AB tests conducted to leverage embeddings and other
multi-modal technologies in search for Adobe Express templates. These templates are complex
multimodal (and multi-page) documents, containing images, text and rich metadata (section 2). Figure 1
shows the Adobe Express template search for a head query and a tail query, where the templates are
displayed to the user as images and template metadata drives the left rail filters.
      </p>
      <p>To improve text search for templates, integrating embeddings required decisions as to:
• Which embedding model(s) to use
• Whether to leverage embeddings for matching (recall), ranking, or reranking
• Whether to use dense or sparse embeddings
• Whether head and tail queries should be treated identically
• Whether embeddings should be used for null and low recovery or everywhere
Other than which embeddings to use, these decisions were driven by latency concerns and by constraints
on integration with Elasticsearch, which was the existing inverted index used for Express template
search. With an ever-increasing collection of ∼300,000 templates, dense embeddings could not be used</p>
      <p>CEUR</p>
      <p>ceur-ws.org
for matching due to the number of scoring calculations, which leads to high latency. This restricted
dense embeddings to (re)ranking, where only a small (&lt;10K) number of top templates had to be scored,
and to scenarios like null and low recovery and long tail queries, where the additional latency was
worth the improved relevance. In addition, certain types of queries performed better with keyword
search, especially those around design type (e.g. poster, Instagram reel) and format (e.g. still, animated,
video).</p>
      <p>To determine the optimal combination, we took an iterative approach with a series of evaluations and
AB tests. We started with existing models with single integrations and then built on these to improve
remaining relevance issues. This paper first overviews the data and models used (section 2) and then
discusses the experiments and how the decisions were made for each of these (section 3).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Models and Data</title>
      <p>This section describes key data and models used for the Express template recall and ranking. The
templates themselves contain rich image, text, and metadata. The standard search behavioral data (e.g.
impressions, clicks) are available, as well as certain application-specific behavioral data (e.g. number of
edits, number of exports). These are briefly described in section 2.1. In addition, we have two types of
multi-modal models: two CLIP text-image models (section 2.2) and an intent-based model (section 2.3).</p>
      <sec id="sec-2-1">
        <title>2.1. Template Data</title>
        <p>Express templates are rich objects which contain many visual layers and text boxes. These can also
be viewed as images, e.g. those that are displayed in search (Figure 1). In addition, templates have
titles provided by the template designers as well as filter information such as design type, style, mood,
region, and price (free/premium). Additional information is inferred about each template including
multi-modal embeddings, user intents, and image tags. Finally, aggregated behavioral data such as
impressions, clicks, edits (number of edits users make to the template in order to personalize them) and
exports (number of times the template is exported after editing) are available. An example template
with its data is shown in Figure 2.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Image-Text CLIP Embeddings</title>
        <p>
          CLIP [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] embeds images and text in the same space. This allows for embedding-based search of images
using text queries. There are several of-the-shelf CLIP models available. However, for Express template
search and for other visual asset search like Adobe Stock, we needed a model that: (1) worked on
short text (queries) as well as long text (captions); (2) covered five languages (English, French, German,
Japanese, Korean); (3) performed well on high-quality image data for templates, photographs and
illustrations; (4) had a sparse version as well as the dense vectors. To meet these requirements, we
trained a CLIP-architecture model on Adobe-licensed image-text data. The text model was particularly
Title: Pink Unicorn Birthday Party Instagram Portrait Post
Topics: confetti, fantasy, glitter, gold, kids, sparkle, star, unicorn
Mood: happy, joyful; Style: bright
Region: all; Language: en-US; Date: 2023-12-12; Behavior: still; License: premium
AI-inferred: AdobeCLIP embedding, Multi-modal CKG embedding,
        </p>
        <p>CKG symbolic intent, autotags</p>
        <p>
          Behavioral: Search impressions, clicks, edits, exports
important since the training focused on Adobe vocabulary, shorter text, and multiple languages; the
training architecture was inspired by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>There are many ways to improve the latency when using embeddings with large numbers of assets.
However, the approximate methods reduce accuracy because the list of assets whose embeddings are
closest to the query embedding is not exact. Once a smaller set of embeddings is selected (e.g. by using
the top n embeddings from the approximate scoring), then the dense embedding can be used to get
more accurate scores for a final ranking. We used a sparsification method which allows the embeddings
to be used similar to keywords in the existing index.1 An example of this is shown in Table 1.</p>
        <p>The dense embeddings have values for every dimension (2048 dimensions in table 1). The sparse
version, which is derived from the dense one, has more dimensions (8192 in table 1) but most of them
have no values. For a query, only assets which match at least n dimensions are returned. In the example,
n is set to 1 and so image 2 matches dimension 3 and image 4 matches dimensions 3 and 8192, while
image 3 is not matched. The scoring is the sum of the matched dimension values weighted by the score
of that dimension for the query. This sparse encoding for matching and scoring is extremely fast, but
comes at the cost of lower accuracy.</p>
        <p>
          The Adobe-specific model (AdobeCLIP) was evaluated for Adobe Express and Stock content, baselining
against the of-the-shelf CLIP versions. For large scale evaluation, a stratified sample of held-out search
1Other approaches are described in [
          <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
          ].
queries were used for semantic search against an index of CLIP and AdobeCLIP embeddings for Express
and Stock content. Past clicked assets were considered relevant, non-clicked items irrelevant. In
addition, we selected titles (generally 5–15 words) for Express templates and Stock images and used
them as long queries, measuring the position of the asset which originally had that title. These methods
provide a lower bound on performance since many of the non-clicked items are relevant and titles often
matched multiple high-ranking images. These two approaches allowed us to quickly compare diferent
versions of AdobeCLIP against one another and against CLIP. Once the AdobeCLIP model outperformed
CLIP and earlier AdobeCLIP versions, we manually inspected results for a subset of held-out queries.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Multi-Modal Creative Knowledge Graph</title>
        <p>
          In addition to learning representations of the content via AdobeCLIP, we found mapping the content’s
intent to discrete nodes improved recall and explainability and allowed for downstream-recommendation
tasks, similar to [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. However, we discovered that self-supervised models like AdobeCLIP, which were
trained on asset-caption and asset-query data like Adobe Stock and Adobe Express failed to accurately
map the asset’s intent to short discrete labels. To accomplish, this we created a “Creative” Knowledge
Graph (CKG) [
          <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
          ] containing over 100K nodes focusing on Adobe-specific user intents. We then
trained a multi-modal transformer (MM-CKG) specializing in mapping assets to these discrete nodes
using supervised contrastive training. We mined concepts for events, actions, objects, moods, canvas
types, colors, and backgrounds to get a robust understanding of an asset’s content. For example, actions
has subtypes of run, dance, …; events has subtypes of birthday, graduation, wedding, seasonal, …; in turn
events∣seasonal has subtypes of Halloween, Thanksgiving, 4th of July, ….
        </p>
        <p>
          To train the model, we created sequence-wise self-attention blocks inspired by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We built our
model on top of a base CLIP backbone and then added a sequence-wise attention block on top that
takes in the hidden states from the last layer of the CLIP backbone that runs through a couple layers of
multi-headed transformer blocks. We utilized the   and   outputs from the sequence-wise attention
heads as the final representation of the input image and text modalities.
2.3.1. Supervised Contrastive Loss (SupCoLA)
We devised our loss function with the following requirements:
1. Alignment to labels: Ensure that the image and text in the training process were close to the label
embeddings.
2. Ability to handle multiple positives in a batch: Traditional contrastive learning (InfoNCE loss
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]) assumes that for a given pair in a batch, all other pairs are negatives. However, when
learning alignment with labels, multiple rows with the same label may be present in a batch. The
loss function should not penalize these rows during loss computation.
3. Ability to have multiple labels per row: Some rows have multiple labels. For example, for the
prompt, boy is sitting on a beach with his dad for father’s day, there are multiple concepts: the
creative intent father’s day, the scene objects boy and beach, and the background beach background.
Our resulting loss function, Label-Aligned Supervised Contrastive Loss, is based on SupCon loss [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
where we pass image, text and label embeddings as anchor features as well as contrast features.
ℒ sup = ∑ ℒ

sup
∈
= ∑ (−
∈
        </p>
        <p>1
 ()
∑</p>
        <p>[ ∑ log
∈()
∈()</p>
        <p>exp(z ⋅ z / )
∑∈()
exp(z ⋅ z / )
])
(1)
where  is the mini batch,  is the index of anchor sample in the batch, () ≡  ∖ 
is the set of all samples
 in the batch that have distinct index than the anchor  , () is the set of all positives  in the batch
that have the same label as anchor  and are views of  . Views of sample  denote the embeddings for
the label, image and text modalities.</p>
        <p>Why do we have two domain-specific multi-modal embeddings (AdobeCLIP and MM-CKG)? These
target and excel at diferent use cases, both of which are important for template search relevance.
MM-CKG is better at determining the underlying key intent from a query and at specific scene object
detection. AdobeCLIP is better at color and layout understanding.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Iterative Experiments</title>
      <p>This section describes the series of on-line experiments conducted to improve the Express template
multi-modal search. We focus primarily on experiments involving multi-modal embeddings, but include
one experiment that leveraged multi-modal content under the hood but text at query time.</p>
      <p>
        The Express template search uses a standard architecture for relevance (Figure 4). It is built on
Elasticsearch. There is an initial matching (recall) step to retrieve documents which broadly match
the user’s query. This step uses keyword-style matching against text and metadata and includes an
initial low latency scoring. Matching using sparse AdobeCLIP embeddings for all queries (section 3.4)
and dense multi-modal CKG embeddings for long queries (section 3.5) were also added. If not enough
results are found, null and low recovery occurs, including a speller (not discussed in this paper; see [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ])
and the use of symbolic CKG intents (section 3.2). The top 10K templates from the initial match set are
then reranked using a much broader set of features. This includes dense multi-modal embeddings as
well as the usual discrete features such as BM25, locale, language, and aggregated behavioral data.
      </p>
      <p>To improve relevance, and hence search click-through rate and export rate, we took an iterative
approach to learn how to optimally leverage our multi-modal understanding, especially the multi-modal
embeddings. Each of these are discussed in detail below. Since we were redesigning the search system,
including the retrieval and ranking platform, some of the experiences were evaluated extensively ofline
and then launched with end-over-end monitoring, while later experiences were AB tested.2</p>
      <sec id="sec-3-1">
        <title>3.1. Reranking with External Image-Text Model</title>
        <p>Our initial experiment used an external, English-only CLIP model in the rescore stage of the ranker.
To do this, we had to determine how many items to rescore. This was largely governed by latency
concerns since we wanted as many items as possible to use the CLIP multi-modal signal. By running
load tests, we determined that we could use the CLIP scores for the top 10K templates, where the initial
ranking was determined by the existing ranker.</p>
        <p>We also had to determine how to weight the CLIP scores in the rescoring. Since the template ranker
at this time was a non-ML, hand-tuned ranker, we determined this based on evaluations of a stratified
query sample. The extreme baseline was to use only the CLIP score for the reranking. This had two
drawbacks: (1) The top results were not visually diverse enough, especially for broad queries like
birthday card or wedding invitation; (2) there was not enough recent content to provide a sense of
freshness and seasonality. To determine a suitable weight, we used a divide-and-conquer approach by
starting with a 50% balance between the first round ranker score and the CLIP score and then adjusting.
This quickly converged on a weighting of basically 2/3 for CLIP and 1/3 for the first round ranker score.
In AB testing, the click-through rate (CTR) and export rate improved with the CLIP-based reranking
(Table 2).3 There was no change in the null rate, as was to be expected.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Null and Low Recovery with Symbolic Multi-Modal Intents</title>
        <p>Due to the broad range of user intents, the limited template collection, and the keyword based retrieval,
users frequently landed on null and low result pages. When the number of results are low (&lt;5 results),
the search engagement with the results drops significantly (up to 2–3x). To reduce the null and low
result rate, we incorporated recovery mechanism using the symbolic CKG intents. The CKG intents
for each template were indexed and the CKG intents for the query were calculated at query time. If
there were &lt;5 results, the CKG query intents were matched against the template intents. For example,
the query hot yoga studio opening has the intents yoga and would match all templates with that intent.
This resulted in major improvements in CTR and null rate (table 3).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ranking with Domain-specific Image-Text Model</title>
        <p>The CLIP model (section 3.1) only worked for English and was not optimized for Express templates
and queries. Replacing CLIP with AdobeCLIP to rerank the top 10K Express templates was expected
2The template collection is continually expanding and so the maximum recall size varied for each experiment.
3All results are shown as percentage change. We are unable to show exact metrics.
to be on par for English queries and improve the CTR for non-English queries. Because the recall
and first-round ranker constrained the result set, the core relevance, especially for head queries, was
unlikely to change significantly, although the torso and tail queries, especially in non-English were
expected be significantly diferent. The move from CLIP to AdobeCLIP was part of a larger AB test
which moved from an older search infrastructure to a newer one which, among other things, allowed
for multiple embedding types. The goal of the AB test was to have no negative efects while moving to
the new platform. This was borne out (table 4).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Recall with Sparse Image-Text Model</title>
        <p>
          None of the previous experiments leveraged the power of embeddings for augmenting the initial match
set. The CLIP and AdobeCLIP models only afected the reranking of the search results. The null
and low recovery with symbolic CKG multi-modal intents only afected null and low queries and
leveraged symbolic intents. Dense embeddings could not be used for the initial match set due to latency
constraints. So, we experimented with using the AdobeCLIP sparse embeddings in the match set to
augment the existing keyword matches.4 This required determining how many dimensions to match in
the sparse embedding (section 2.2) in order to retrieve enough new relevant documents and not too
many irrelevant ones. As is well known in the literature [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ] determining accurate thresholds on
embeddings using cosine similarity or dot product is not feasible.5 The sparse embedding approach
allowed us to require a minimum number of asset dimension matches. Once the retrieval approach
was determined, the ranking was updated to demote less relevant templates retrieved by the sparse
embeddings (table 5).
4We considered using only AdobeCLIP sparse for the match set but rejected this since the model performed badly at identifying
videos, which are popular queries and important to the business. AdobeCLIP only has the image embedding to match against
and the images from Express video templates are extremely similar to those of still templates.
5See [16] for a recent approach to determining relevance thresholds with embeddings.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Long Query Recall and Ranking with Multi-Modal Model</title>
        <p>The above experiments improved relevancy for head queries, both in matching the intent of the user
query and in quality of the templates shown. In addition, the improved recall from CKG symbolic
intents for null and low recovery (section 3.2) and the addition of sparse AdobeCLIP embeddings into
the initial match set (section 3.4) resulted in a broad set of related templates being shown when there
are few exact matches. However, there are often few exact match templates for more specific user
queries, i.e. for tail queries.</p>
        <p>To address this issue, we targeted longer (&gt;=4 words) to use the CKG multi-modal embedding
(MM-CKG section 2.3). The more specific intents of the longer queries work especially well with the
domain-specific embeddings, allowing the recall and ranking to find the few templates that exactly
match the user intent. The ranking combined 1/3 the weight on MM-CKG and 2/3 the weight on
AdobeCLIP. The hypothesis behind this was that AdobeCLIP captures the core relevance matching the
query text to the image rendition, while MM-CKG captures the underlying intent of the query and
the template. The optimal query length for this experience was determined empirically by manually
judging a stratified sample of queries of diferent lengths, comparing production to the new experience.
There was a clear demarcation between queries of &lt;4 words and those of &gt;=4 words. Table 6 shows
that for &lt;4 words, both production and MM-CKG largely provide relevant results, i.e. for head queries
both approaches work well. However, for &gt;=4 words the new MM-CKG results are significantly better
than those in production.</p>
        <p>The MM-CKG embedding does not yet have a sparse version of the type available for AdobeCLIP
(section 2.2). For this reason, it was not feasible from a latency perspective to use the MM-CKG matching
and ranking approach used for all queries. However, since long queries are &lt;10% of the query trafic,
we determined that the increased latency from calculating cosine similarity scores between the query
and all of the templates was worth the improvements to relevance.</p>
        <p>The AB test launched showcased statistically significant improvements of CTR and null rate on long
queries and prompts, highlighting the usefulness of the hybrid system.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Multi-modal search experiences in industry applications traditionally depend on textual data in the
index, thereby reducing the multi-modal search to a traditional keyword search. This provides a low
latency experience since industry search engines are heavily optimized for keyword search. The advent
of high quality multi-modal embeddings like CLIP has provided radically new capabilities. However, in
an existing application, such as Adobe Express template search in this paper, the available multi-modal
capabilities and the existing infrastructure including strict latency requirements, require a thoughtful,
iterative approach to integrating new multi-modal technologies. This paper described five multi-modal
experiments in Express template search, each of which built upon the others. This has resulted in
significantly lower null and low rates, while improving click-through rates.
[16] N. Rossia, J. Lin, F. Liu, Z. Yang, T. Lee, A. Magnani, C. Liao, Relevance filtering for embedding-based
retrieval, in: Proceedings of CIKM2024, ACM, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . ArXiv:
          <volume>2103</volume>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Eisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rekathati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          ,
          <article-title>Cross-lingual and multilingual CLIP</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>6848</fpage>
          -
          <lpage>6854</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>739</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Dudek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Bendersky,
          <article-title>Sparseembed: Learning sparse lexical representations with contextual embeddings for retrieval</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Product quantization for nearest neighbor search</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>33</volume>
          (
          <year>2011</year>
          )
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kusupati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rege</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wallingford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramanujan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Howard-Snyder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kakade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , Matryoshka representation learning,
          <year>2024</year>
          . arXiv:
          <volume>2205</volume>
          .
          <fpage>13147</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-H. Abel</surname>
          </string-name>
          , P. Gouspillou,
          <article-title>Combining embedding-based and semantic-based models for post-hoc explanations in recommender systems</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2401.04474. arXiv:
          <volume>2401</volume>
          .
          <fpage>04474</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deshmukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Suresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sadaphule</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stefan</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://patents.google.com/patent/US11645095B2/en, United States Patent and Trademark Ofice, number US11645095B2.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , Multimodal input contextual font,
          <year>2023</year>
          . URL: https://patents.google. com/patent/US11775734B2/en, United States Patent and Trademark Ofice, number US11775734B2.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <article-title>Contextual font recommendations based on user intent</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.08188. arXiv:
          <volume>2306</volume>
          .08188, presented at ECOM23 the SIGIR Workshop on e-Commerce Search.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-C. Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , Cma-clip:
          <article-title>Cross-modality attention clip for image-text classification</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2112.03562. arXiv:
          <volume>2112</volume>
          .
          <fpage>03562</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. van den Oord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Representation learning with contrastive predictive coding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1807</year>
          .03748. arXiv:
          <year>1807</year>
          .03748.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Teterwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sarna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maschinot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          , Supervised contrastive learning,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2004</year>
          .11362. arXiv:
          <year>2004</year>
          .11362.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Valls-Vargas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guerin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <article-title>Contextual multilingual spellchecker for user queries</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>3395</fpage>
          -
          <lpage>3399</lpage>
          . URL: https://doi.org/10.1145/3539618.3591861. doi:
          <volume>10</volume>
          .1145/ 3539618.3591861.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ethayarajh</surname>
          </string-name>
          ,
          <article-title>How contextual are contextualized word representations? comparing the geometry of bert, elmo</article-title>
          , and gpt-2 embeddings,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1909</year>
          .00512. arXiv:
          <year>1909</year>
          .00512.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Moubayed</surname>
          </string-name>
          ,
          <article-title>Length is a curse and a blessing for documentlevel semantics</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.16193. arXiv:
          <volume>2310</volume>
          .
          <fpage>16193</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>