<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Caspari</string-name>
          <email>laura.caspari@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kanishka Ghosh Dastidar</string-name>
          <email>kanishka.ghoshdastidar@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saber Zerhoudi</string-name>
          <email>saber.zerhoudi@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jelena Mitrovic</string-name>
          <email>jelena.mitrovic@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Granitzer</string-name>
          <email>michael.granitzer@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Passau</institution>
          ,
          <addr-line>Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare diferent families of embedding models, including proprietary ones, across ifve datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top- retrieval similarity reveals high-variance at low  values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large language model</kwd>
        <kwd>Retrieval-augmented generation</kwd>
        <kwd>Model similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Motivation</title>
      <p>
        Retrieval-Augmented Generation (RAG) is an emerging
paradigm that helps mitigate the problems of factual
hallucination [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and outdated training data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] of large language
models (LLMs) by providing these models with access to
an external, non-parametric knowledge source (e.g. a
document corpus). Central to the functioning of RAG
frameworks is the retrieval step, wherein a small subset of
candidate documents is retrieved from the document corpus,
specific to the input query or prompt. This retrieval
process, known as dense-retrieval, hinges on text embeddings.
Typically, the generation of these embeddings is assigned
to an LLM, for which there are several options due to the
rapid evolution of the field. Consequently, selecting the
most suitable embedding model from an array of available
choices emerges as a critical aspect in the development of
RAG systems. The information to guide this choice is
currently primarily limited to architectural details (which are
also on occasion scarce due to the prevalence of closed
models) and performance benchmarks such as the Massive Text
Embedding Benchmark (MTEB) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        We posit that an analysis of the similarity of the
embeddings generated by these models would significantly aid
this model selection process. Given the large number of
candidates and ever increasing scale of the models, a
fromscratch empirical evaluation of the embedding quality of
these LLMs on a particular task can incur significant costs.
This challenge becomes especially pronounced when
dealing with large-scale corpora comprising potentially millions
of documents. While the relative performance scores of
these models on benchmark datasets ofer the simplified
perspective of comparing a single scalar value on an
array of downstream tasks, such a view of model similarity
might overlook the nuances of the relative behaviour of
the models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As an example, the absolute diference in
precision@k between two retrieval systems only provides a
weak indication of the overlap of retrieved results. We argue
that identifying clusters of models with similar behaviour
would allow practitioners to construct smaller, yet diverse
candidate pools of models to evaluate. Beyond model
selection, as highlighted by Klabunde et al., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], such an analysis
also facilitates the identification of common factors
contributing to strong performance, easier model ensembling,
and detection of potential instances of unauthorized model
reuse.
      </p>
      <p>In this paper, we analyze diferent LLMs in terms of the
similarities of the embeddings they generate. Our similarity
analysis serves as an unsupervised evaluation framework
for these embedding models, in contrast to performance
benchmarks that require labelled data. We do this from a
dual perspective - we directly compare the embeddings
using representational similarity measures. Additionally, we
evaluate model similarity specifically in terms of their
functional impact on RAG systems i.e. we look at how similar
the retrieved results are. Our evaluation focuses on
several prominent model families, to analyze similarities both
within and across them. We also compare proprietary
models (such as those by OpenAI or Cohere) to open-sourced
ones in order to identify the most similar alternatives. Our
experiments are carried out on five popular benchmark
datasets to determine if similarities between models are
influenced by the choice of data. Our code is available at
https://github.com/casparil/embedding-model-similarity.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Studies evaluating similarities of neural networks fall into
two main categories: the first involves comparing
activations of diferent models generated at any pair of layers for a
specific input (representational similarity), while the second
compares the model outputs (functional similarity). Raghu
et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Morcos et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] propose measures building
on Canonical Correlation Analysis (CCA) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a statistical
technique used to find the linear relationship between two
sets of variables by maximizing their correlation. Such
comparisons using CCA or variants thereof can be found in
several works [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Beyond CCA-based measures,
other works have also explored computing correlations [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
and the mutual information [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] between neurons across
networks. Kornblith et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] propose Centered Kernel
Alignment (CKA), which they show improves over several
similarity measures in identifying corresponding layers of
identical networks with diferent initializations. A diverse
range of functional similarity evaluations have also been
explored in the literature. A few examples include
modelstitching [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], disagreement measures between
output classes [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], and quantifying the similarity
between the class-wise output probabilities [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. We would
point the reader to the survey by Klabunde et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for a
detailed overview of representational and functional similarity
measures.
      </p>
      <p>
        Recently, a few works have also focused on specifically
evaluating the similarity of LLMs. While Wu et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
evaluate language models along several perspectives, such
as their representational and neuron-level similarities, their
evaluation pre-dates the introduction of the recent wave
of large scale models. Freestone and Santu [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] consider
similarities of word embeddings, and evaluate if LLMs
differ significantly to classical encoding models in terms of
their representations. The works by Klabunde et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and Brown et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] are more recent, and evaluate the
representational similarity of LLMs, with the latter also
considering the similarities between models of diferent sizes
in the same model family.
      </p>
      <p>
        Much of the literature on evaluation of LLM embeddings
focuses on their performance on downstream tasks, with
benchmarks such as BEIR [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] (for retrieval specifically) and
MTEB [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] providing a unified view of embedding quality
across metrics and datasets. The metrics used here mostly
include typical information retrieval metrics such as
precision, recall, and mean reciprocal rank at certain cutofs.
Some works specifically evaluate the retrieval components
in a RAG context, where they either use a dataset outside
of those included in the benchmarks [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] or where the
evaluation encompasses other aspects of the retriever beyond
the embedding model being used [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Another approach,
that does not rely on ground-truth labels, is given by the
Retrieval Augmented Generation Assessment (RAGAS)
framework, which uses an LLM to determine the ratio of sentences
in the retrieved context that are relevant to the answer
being generated [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. To the best of our knowledge, there are
no works that evaluate the similarity of embedding models
from a retrieval perspective.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>We evaluate embedding model similarity using two
approaches. The first directly compares the embeddings of text
chunks generated by the models. The second approach is
specific to the RAG context, where we evaluate the
similarity of retrieved results for a given query. These approaches
are discussed in detail in the following sections.</p>
      <sec id="sec-3-1">
        <title>3.1. Pair-wise Embedding Similarity</title>
        <p>
          There are several metrics defined in the literature that
measure representational similarity [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Many of these metrics
require the representation spaces of the embeddings to be
compared to be aligned and/or the dimensionality of the
embeddings across the models to be identical. To avoid these
constraints, we pick Centered Kernel Alignment (CKA) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
with a linear kernel as our similarity measure.
        </p>
        <p>
          The measure computes similarity between two sets of
embeddings in two steps. First, for a set of embeddings,
the pair-wise similarity scores between all entries within
this set are computed using the kernel function. Thus, row
k of the resulting similarity matrix contains entries
representing the similarity between embedding k and all other
embeddings, including itself. Computing two such
embedding similarity matrices for diferent models with the same
number of embeddings then leads to two matrices E and
E’ of matching dimensions. These are compared directly
in the second step with the Hilbert-Schmidt Independence
Criterion (HSIC) [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] using the following formula:
(,′)
(, ′) = √(,)(′,′)
        </p>
        <p>
          The resulting similarity scores are bounded in the interval
[
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] with a score of 1 indicating equivalent representations.
CKA assumes that representations are mean-centered.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Retrieval Similarity</title>
        <p>While a pair-wise comparison of embeddings ofers insights
into the similarities of the representations learned by these
models, it does not sufice to quantify the similarities in
outcomes when these embedding models are deployed for
specific tasks. Therefore, in context of RAG systems, we
consider the similarity of retrieved text chunks for a given
query, when diferent embedding models are used. As a
ifrst step, for a given dataset, we generate embeddings of
queries and document chunks with each of the embedding
models. We then retrieve the  most similar embeddings
in terms of the cosine similarity for a particular query. As
these embeddings correspond to specific chunks of text, we
derive the sets of retrieved chunks C and C’ for a pair of
models. To measure the similarity of these sets, we use the
Jaccard similarity coeficient as follows:
 (, ′) = |∩′|
|∪′|</p>
        <p>
          Here, | ∩ ′| corresponds to the overlap in text chunks
by counting how often the two models retrieved the same
chunks. Similarly, we can compute the union | ∪ ′|,
which corresponds to all retrieved text chunks, counting
chunks present in both sets only once. The resulting score
is bounded in the interval [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] with 1 indicating that both
models retrieved the same set of text chunks.
        </p>
        <p>
          While Jaccard similarity computes the percentage to
which two sets overlap, it ignores the order in the sets. Rank
similarity [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], on the other hand, considers the order of
common elements, with closer elements having a higher
impact on the score. The measure assigns ranks to common
text chunks according to their similarity to the query, i.e.
 () =  if chunk  was the top- retrieved result for the
query. Ranks are then compared using:
2
( (), ′ ()) = (1+| ()− ′ ()|)( ()+′ ())
        </p>
        <p>With this, rank similarity for two sets of retrieved text
chunks C, C’ is calculated as:</p>
        <p>∑︁
∈|∩′|
(, ′) = (|1∩′|)</p>
        <p>
          ( (), ′ ())
with (| ∩ ′|) = ∑︀=|∩′| 1 denoting the K-th
=1
harmonic number, normalizing the score. Like the other
measures, rank similarity is bounded in the interval [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
with 1 indicating that all ranks are identical.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>The following paragraphs describe our choice of datasets
and models, along with details of the implementation of our
experiments.</p>
      <p>
        As we focus on the retrieval component of RAG
systems, we select five publicly available datasets from the
BEIR benchmark [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. As generating embeddings for large
datasets is a time-intensive process, especially for a larger
number of models, we opt for five of the smaller datasets
from the benchmark. This approach allows us to compare
embeddings generated by a variety of models while at the
same time allowing us to evaluate embedding similarity
accross datasets. An overview of the datasets is shown in Table
1. For each dataset, we create embeddings by splitting
documents into text chunks such that each chunk contains 256
tokens. The embedding vectors are stored with Chroma DB
[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], an open source embedding database. For each vector,
we additionally store information about the document and
text chunk ids it encodes to be able to match embeddings
generated by diferent models for evaluation.
      </p>
      <p>
        For model selection, we primarily use publicly available
models from the MTEB leaderboard [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We do not simply
pick the best performing models on the leaderboard; instead,
our choices are influenced by several factors. Firstly, we
focus on analyzing similarities within and across model
families and pick models belonging to the e5 [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], t5 [
        <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
        ],
bge [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], and gte [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] families. Secondly, we recognize
that it might be of interest to users to avoid pay-by-token
policies of proprietary models by identifying similar
opensource alternatives. Therefore, we pick high-performing
proprietary models, two from OpenAI
(text-embedding-3large and -small) [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] and one from Cohere (Cohere
embedenglish-v3.0) [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. We also compare the
mxbai-embed-largev1 (mxbai) [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] and UAE-Large-V1 (UAE) [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] models, that
not only report very similar performances on MTEB, but also
identical embedding dimensions, model size and memory
usage. Finally, we include SFR-Embedding-Mistral (Mistral)
[
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] as the best-performing model on the leaderboard at the
time of our experiments. A detailed overview of all selected
models can be seen in Table 2.
      </p>
      <p>
        To compare embedding similarity across models and
datasets, we employ diferent strategies depending on the
similarity measure. We apply CKA by retrieving all
embeddings created by a model, matching embeddings using
their document and text chunk ids and then computing
their similarity for each of the five datasets. For Jaccard
and rank similarity, we use sklearn’s NearestNeighbor class
[
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] to determine the the top- retrieval results. We
compute Jaccard and rank scores per dataset, averaging over 25
queries. For the NFCorpus dataset, we calculate retrieval
similarity for all possible , i.e. using all embeddings
generated for the dataset. As calculating similarity for each
possible  is computationally expensive, we did not repeat
this for the remaining datasets and chose a smaller  value
instead. Furthermore, as only a limited number of results
are to be provided as context to the generative model,
analyzing retrieval similarity at low  values for e.g. top-10 is
of most interest. As we are interested in identifying clusters
of similar models, we also perform a hierarchical clustering
on heatmap values using Seaborn [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ]. The following
section describes the results of our evaluation for the diferent
measures.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>To evaluate how similar embeddings generated by diferent
models are, we will first consider model families, checking if
their pairwise and top-k similarity scores are highest within
their family. Subsequently, we will identify the open source
models which are most similar to our chosen proprietary
models.</p>
      <sec id="sec-5-1">
        <title>5.1. Intra- and Inter-Family Clusters</title>
        <p>Comparing embeddings directly with CKA shows high
similarity across most of the models, albeit with some variance.
These scores allow us to identify certain clusters of models.
Figure 1 shows the pair-wise CKA scores of all models
averaged across the five datasets. As expected, scores for most
models are highest within their own family. This holds true
for the gtr-t5, sentence-t5 and text-embedding-3 (OpenAI)
models. Although the sentence-t5 and gtr-t5 models are
closely related, they do not exhibit significantly higher
similarity with each other compared to the remaining models.</p>
        <p>From an inter-family perspective, we observe high
similarity between the bge and gte models. For some models
in these two families, interestingly, the highest similarity
scores rather correspond to inter-family counterparts with
matching embedding dimensions than with models in the
same family. Specifically, gte-small reports the highest
similarity to bge-small and gte-base to bge-base. On the other
hand, gte-large shows slightly higher similarity to bge-base
than bge-large and thus to a model with a lower embedding
dimension. Another inter-family cluster is formed by the
three models with the highest CKA scores overall, namely
0.6
0.5
0.4
0.3
0.2
0.1
0.6
0.4
0.2
0.00
10
20
30
40
50
0
1000
2000
3000
4000
5000
6000
UAE, mxbai and bge-large, whose scores suggest almost
perfect embedding similarity. In fact, the similarity score of
bge-large to these two models is much higher than to other
bge models.</p>
        <p>Shifting our attention to top- retrieval similarity, clusters
vary depending on the  value. Figure 3 illustrates how
Jaccard similarity evolves over  on NFCorpus. The first
plot displays Jaccard scores between bge-large and all other
models, while the second plot illustrates the scores for
gtelarge. For extremely low , we observe some peaks for
nearly all models, followed by a noticeable drop in similarity.
Of course, for larger , the scores converge to one.
Reafirming our earlier observations with the CKA metric,
bge-large demonstrates high retrieval similarity with UAE
and mxbai. Similarity to the remaining models is much
lower, with the highest scores for bge-base and bge-small
for larger . However, especially for small , there is high
variance in similarity score, with models from other families,
e.g. Mistral or gte-large sometimes achieving higher scores
than the bge models. A similar pattern can also be observed
in the second plot, where Jaccard similarity for gte-large
is highest within its family for larger , but models like
mxbai or bge-base sometimes reporting higher similarity
for small . Therefore, the clusters we identified through
our CKA analysis are only truly reflected in these plots for
large values of . This suggest that in real-world use cases,
where the top- are crucial, such representational similarity
measures might not provide the full picture. The plots for
other model families provide nearly identical insights as
those in the second plot in Figure 3 and thus we do not
present them for sake of brevity.</p>
        <p>For rank similarity, scores peak for small  and then
quickly start to drop until they approach a low stable score
for larger  as shown in Figure 2 for gte-large. Once again,
the bge/UAE/mxbai inter-family cluster shows the highest
similarity. In contrast to Jaccard similarity, the clusters that
could be observed for CKA do not always show for rank
similarity. As can be seen in Figure 2, the model with the
highest rank similarity to gte-large is mxbai, rather than
1.0
0.8
0.6
0.4
0.2
1.0
0.8
0.6
0.4
3000
(a)
0.8
0.6
0.4
0.2
0.00
4000
3000
(b)
0.6
0.5
0.4
0.3
0.2
0.1
0.00
4000
0
1000
2000
6000
0
1000
2000
10</p>
        <p>20
5000
30
40
50
10</p>
        <p>20
5000
30
40</p>
        <p>50
6000
another gte model. Even so, the previously observed
clusters also tend to appear for rank similarity, though they
vary more depending on the models and dataset.
Generally, scores for nearly all models are rather small for larger
, indicating low rank similarity. For small , results vary
more and diferences between individual models are more
pronounced.</p>
        <p>As retrieval similarity at small  is of most interest from a
practical perspective, we take a closer look at top-10 Jaccard
similarity. The heatmaps in Figures 4-6 show the top-10
Jaccard similarity between models across datasets. A striking
insight here is that even the most similar models only report
a Jaccard similarity of higher than 0.6, with the majority
less than 0.5. This is of great relevance to practitioners, as
it would imply that even using embeddings from models
that report high representational similarity scores may yield
little overlap in retrieved text chunks. As earlier, the cluster
of UAE/mxbai/bge-large is the most prominent one with
the highest scores. Intra-family scores tend to be the
highest for most models, i.e. t5 and OpenAI. Depending on the
dataset, this also applies to gte and e5 models, although
Jaccard similarity to models from other families is
sometimes higher. We also note that for the two larger datasets
FiQA-2018 and TREC-COVID, the similarity scores are
generally substantially lower, as can be seen in Figure 6. For
the smaller datasets, Jaccard similarity is generally higher,
though results difer depending on the data (see Figures 4
and 5). Similar observations can be made for rank similarity,
although the appearance of family clusters is more
dependent on the dataset. Larger datasets also lead to lower scores.
These results illustrate that while family clusters can still
be perceived at small , they are not as prominent as they
are for larger . Furthermore, the top-10 retrieved results
difer substantially for most models and datasets and their
similarity might be dependent on the dataset itself.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Open Source Alternatives to</title>
      </sec>
      <sec id="sec-5-3">
        <title>Proprietary Models</title>
        <p>We explicitly included proprietary models in our analysis to
check which open source models are the best - which in our
case means the most similar - alternative. The CKA scores
in Figure 1 indicate that embeddings generated by OpenAI’s
models (text-embedding-3-large/-small) are highly similar to
those generated by Mistral, while the Cohere model
(embedenglish-v3.0) demonstrates high similarity to e5-large-v2.</p>
        <p>These observations do not entirely extend to retrieval
similarity, especially for Cohere. While Mistral is still the most
similar model to OpenAI’s for larger  across all datasets,
there is no consistently most similar model for Cohere.
Rather, the model varies depending on the dataset and
measure - Jaccard and rank similarity - sometimes showing
highest similarity to e5-large-v2, but sometimes also to other
models like Mistral. Taking a closer look at top-10
similarity, Mistral still largely exhibits the highest similarity to the
OpenAI models, especially to text-embedding-3-large. For
text-embedding-3-small, scores on all datasets are rather
close to those of other models. In absolute terms, however,
retrieval similarity between Mistral and OpenAI models
is only low to moderate. On smaller datasets, the highest
Jaccard similarity to text-embedding-3-large only reaches
about 0.6 (see Figure 5), while on TREC-COVID, the largest
dataset, Jaccard similarity goes down to merely 0.18 (see
Figure 6). For Cohere’s model, the most similar model for
top-10 Jaccard similarity is diferent for each dataset, with
the highest scores of 0.51 occurring on ArguAna shwon in
Figure 5. For all proprietary models, even the best retrieval
similarity at top-10 still suggests that the embeddings that
would be presented to an LLM can difer notably. Once
again, we could also observe dataset-dependent variance in
scores, with lower retrieval similarity on larger datasets.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>While a pair-wise comparison of embeddings using CKA
shows intra- and inter-family model clusters, retrieval
similarity over diferent  ofers a more nuanced picture.
Especially for small , which are of most interest from a practical
perspective, retrieval similarity varies. When comparing
the top-10 retrieved text chunks, the low Jaccard similarity
scores indicate little overlap in retrieved chunks, even when
CKA scores are high. Especially for the two larger datasets
FiQA-2018 and TREC-COVID, these scores are extremely
low. As RAG systems usually operate on millions of
embeddings, our datasets are comparatively small. Therefore,
should a general trend of larger datasets leading to lower
retrieval similarity exist, text chunks retrieved by
diferent models in a regular use case might be nearly distinct
for small . Overall, our results suggest that even though
embeddings seem rather similar when compared directly,
retrieval performance can still vary substantially, is most
unstable for  values that are commonly used in RAG
systems and also dataset-dependent. Retrieved chunks at small
 show the least overlap, leading to high diferences in data
that would be presented to an LLM as additional context.</p>
      <p>Our analysis demonstrates that although models tend
to be most similar to models from their own family,
interfamily clusters exist. The most prominent of these clusters
is formed by the models bge-large-en-v1.5, UAE-Large-V1
and mxbai-embed-large-v1, which demonstrate high
similarity even for retrieval at low . Nevertheless, the high
variance of retrieval similarity of the remaining clusters
for small  means that while the identified clusters might
provide some measure of orientation when choosing an
embedding model, the choice still remains a non-trivial task.
Identifying suitable alternatives to proprietary models is
likewise not as simple. While we were able to determine
SFR-Embedding-Mistral as the model being most similar to
OpenAI’s embedding models, Jaccard similarity at top-10
for larger datasets shows a low overlap in retrieved text
chunks. Furthermore, for Cohere’s embedding model, we
were unable to find a single most similar model, as this
model varied across datasets for small  values.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper we evaluated the similarity of embedding
models on diferent datasets. Given the large number of available
models, identifying clusters or families of models with
similar embeddings can simplify the model selection process.
While previous work on LLM similarity exists, to the best
of the authors’ knowledge, it so far lacks a clear focus on
embedding models specifically in the context of RAG. We
therefore analyzed the similarity of embeddings generated
by 19 diferent models using CKA for pairwise comparison
as well as Jaccard and rank similarity to compare retrieval
behavior at top- across five datasets. Comparing
embeddings with CKA generally showed intra- and inter-family
clusters across datasets. These clusters also appeared when
evaluating top- retrieval similarity with large  values.
However, scores for low  values, which would commonly
be chosen in RAG systems, show high variance and much
lower similarity, especially on larger datasets. Although we
were able to identify some model clusters, our results
suggest that choosing the optimal model remains a non-trivial
task that requires careful consideration.</p>
      <p>Still, we argue that a better understanding of how
similarly diferent embedding models behave is an important
research topic that requires further attention. There are a
plethora of real-world scenarios where RAG systems can
potentially be deployed. One such scenario, for example,
is to retrieve relevant web content in response to a query.
As large corpora of such data are available in the form of
Web ARChive (WARC) files, evaluating embedding model
similarity on such large, uncleaned datasets would perhaps
lead to a better estimation of model similarity for a realistic
RAG use case. Additionally, as documents often need to
be chunked into smaller parts to fit into the models, the
efect of chunking strategies such as token-based or
semantic chunking on embedding similarity could be explored.
Furthermore, our evaluation focused on a small sample of
similarity measures, with their own definition about which
criteria make models similar. Introducing more measures
with diferent perspectives could improve our
understanding on which factors influence model similarity. Finally,
our focus was on identifying clusters or families of models,
which for our initial experiments led us to choosing families
of embedding models with varying performance on MTEB.
With the frequent appearance of new models on the
leaderboard and the focus on good MTEB performance, it would
be of interest to compare the best performing models on
MTEB and check if their relative diference in performance
correlates with how similar these models are.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has received funding from the European Union’s
Horizon Europe research and innovation program under
grant agreement No 101070014 (OpenWebSearch.EU, https:
//doi.org/10.3030/101070014).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Mousavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alghisi</surname>
          </string-name>
          , G. Riccardi,
          <article-title>Is your llm outdated? benchmarking llms &amp; alignment algorithms for time-sensitive knowledge</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2404</volume>
          .
          <fpage>08700</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Magne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          , Mteb: Massive text embedding benchmark,
          <year>2023</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>07316</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klabunde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schumacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strohmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lemmerich</surname>
          </string-name>
          ,
          <article-title>Similarity of neural network models: A survey of functional and representational measures</article-title>
          ,
          <source>arXiv preprint arXiv:2305.06329</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klabunde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Amor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Granitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lemmerich</surname>
          </string-name>
          ,
          <article-title>Towards measuring representational similarity of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.02730</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohl-Dickstein</surname>
          </string-name>
          ,
          <article-title>Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Morcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghu</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Bengio,</surname>
          </string-name>
          <article-title>Insights on representational similarity in neural networks with canonical correlation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Hardoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Szedmak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shawe-Taylor</surname>
          </string-name>
          ,
          <article-title>Canonical correlation analysis: An overview with application to learning methods</article-title>
          ,
          <source>Neural computation 16</source>
          (
          <year>2004</year>
          )
          <fpage>2639</fpage>
          -
          <lpage>2664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Denain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Grounding representation similarity through statistical testing</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>1556</fpage>
          -
          <lpage>1568</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zullich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pellegrino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Medvet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ansuini</surname>
          </string-name>
          , et al.,
          <article-title>On the similarity between hidden layers of pruned and unpruned convolutional neural networks</article-title>
          ,
          <source>in: Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods</source>
          , Scitepress,
          <year>2020</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <article-title>Inner product-based neural network similarity</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lipson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hopcroft</surname>
          </string-name>
          ,
          <article-title>Convergent learning: Do diferent neural networks learn the same representations</article-title>
          ?,
          <year>2016</year>
          . arXiv:
          <volume>1511</volume>
          .
          <fpage>07543</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lipson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hopcroft</surname>
          </string-name>
          ,
          <article-title>Convergent learning: Do diferent neural networks learn the same representations?</article-title>
          , in: D.
          <string-name>
            <surname>Storcheus</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rostamizadeh</surname>
          </string-name>
          , S. Kumar (Eds.),
          <source>Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS</source>
          <year>2015</year>
          , volume
          <volume>44</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , PMLR, Montreal, Canada,
          <year>2015</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>212</lpage>
          . URL: https: //proceedings.mlr.press/v44/li15convergent.html.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Similarity of neural network representations revisited</article-title>
          , in: K. Chaudhuri, R. Salakhutdinov (Eds.),
          <source>Proceedings of the 36th International Conference on Machine Learning</source>
          , volume
          <volume>97</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3519</fpage>
          -
          <lpage>3529</lpage>
          . URL: https://proceedings.mlr.press/v97/kornblith19a.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakkiran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Barak</surname>
          </string-name>
          ,
          <article-title>Revisiting model stitching to compare neural representations</article-title>
          , in: M.
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>34</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2021</year>
          , pp.
          <fpage>225</fpage>
          -
          <lpage>236</lpage>
          . URL: https: //proceedings.neurips.cc/paper_files/paper/2021/file/ 01ded4259d101feb739b06c399e9cd9c-Paper.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lenc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>Understanding image representations by measuring their equivariance and equivalence</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>991</fpage>
          -
          <lpage>999</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Balogh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jelasity</surname>
          </string-name>
          ,
          <article-title>On the functional similarity of robust and non-robust neural representations</article-title>
          , in: A.
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Brunskill</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Engelhardt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sabato</surname>
          </string-name>
          , J. Scarlett (Eds.),
          <source>Proceedings of the 40th International Conference on Machine Learning</source>
          , volume
          <volume>202</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1614</fpage>
          -
          <lpage>1635</lpage>
          . URL: https://proceedings. mlr.press/v202/balogh23a.html.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Milani Fard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cormier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Canini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Launch and iterate: Reducing prediction churn</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>29</volume>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , L. Ma,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Difchaser: Detecting disagreements for deep neural networks</article-title>
          ,
          <source>International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Y. Liu,
          <article-title>Modeldif: testing-based dnn similarity comparison for model reuse detection</article-title>
          ,
          <source>in: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis</source>
          ,
          <source>ISSTA '21</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2021</year>
          . URL: http: //dx.doi.org/10.1145/3460319.3464816. doi:
          <volume>10</volume>
          .1145/ 3460319.3464816.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>J. M. Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Belinkov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Sajjad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Durrani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Dalvi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>Similarity analysis of contextual word representation models</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2005</year>
          .01172.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freestone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K. K.</given-names>
            <surname>Santu</surname>
          </string-name>
          , Word embeddings revisited:
          <source>Do llms ofer something new?</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>11094</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brown</surname>
          </string-name>
          , C. Godfrey,
          <string-name>
            <given-names>N.</given-names>
            <surname>Konz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kvinge</surname>
          </string-name>
          ,
          <article-title>Understanding the inner workings of language models through representation dissimilarity</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>14993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Beir: A heterogenous benchmark for zeroshot evaluation of information retrieval models</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2104</volume>
          .
          <fpage>08663</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Finardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Avila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Castaldoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gengo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Larcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Piau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Caridá</surname>
          </string-name>
          ,
          <article-title>The chronicles of rag: The retriever, the chunk and the generator</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2401</volume>
          .
          <fpage>07883</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sawarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mangal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Solanki</surname>
          </string-name>
          ,
          <article-title>Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2404</volume>
          .
          <fpage>07220</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Es</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , Ragas:
          <source>Automated evaluation of retrieval augmented generation</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>15217</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gretton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <article-title>Measuring statistical dependence with hilbert-schmidt norms</article-title>
          , in: S. Jain,
          <string-name>
            <given-names>H. U.</given-names>
            <surname>Simon</surname>
          </string-name>
          , E. Tomita (Eds.),
          <source>Algorithmic Learning Theory</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2005</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <article-title>Towards understanding the instability of network embedding</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>927</fpage>
          -
          <lpage>941</lpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2020</year>
          .
          <volume>2989512</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>C.</given-names>
            <surname>Inc</surname>
          </string-name>
          .,
          <string-name>
            <surname>Chroma</surname>
          </string-name>
          , Chroma Homepage,
          <year>2024</year>
          . URL: https: //docs.trychroma.com/.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Text embeddings by weaklysupervised contrastive pre-training</article-title>
          ,
          <source>arXiv preprint arXiv:2212.03533</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Ábrego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          , J. Ma, K. B.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , Sentence-t5:
          <article-title>Scalable sentence encoders from pre-trained text-to-text models</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2108</volume>
          .
          <fpage>08877</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Ábrego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , V. Y.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. B. Hall</surname>
            , M.-
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Large dual encoders are generalizable retrievers</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2112</volume>
          .
          <fpage>07899</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          , C-pack:
          <article-title>Packaged resources to advance general chinese embedding</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>07597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Towards general text embeddings with multi-stage contrastive learning</article-title>
          ,
          <source>arXiv preprint arXiv:2308.03281</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <article-title>New embedding models with lower pricing</article-title>
          ,
          <source>OpenAI Blog</source>
          ,
          <year>2024</year>
          . URL: https://openai.com/blog/ new-embedding
          <article-title>-models-and-api-updates.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Cohere</surname>
          </string-name>
          ,
          <article-title>Embeddings - text embeddings with advanced language models</article-title>
          ,
          <source>Cohere Homepage</source>
          ,
          <year>2024</year>
          . URL: https: //cohere.com/embeddings.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shakir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koenig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lipp</surname>
          </string-name>
          ,
          <article-title>Open source strikes bread - new flufy embeddings model</article-title>
          ,
          <year>2024</year>
          . URL: https: //www.mixedbread.ai/blog/mxbai-embed
          <article-title>-large-v1.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Angle-optimized text embeddings</article-title>
          ,
          <source>arXiv preprint arXiv:2309.12871</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>R.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yavuz</surname>
          </string-name>
          ,
          <article-title>Sfr-embedding-mistral:enhance text retrieval with transfer learning</article-title>
          ,
          <source>Salesforce AI Research Blog</source>
          ,
          <year>2024</year>
          . URL: https://blog.salesforceairesearch.com/ sfr-embedded-mistral/.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <source>Scikitlearn: Machine learning in Python, Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <surname>M. L. Waskom</surname>
          </string-name>
          ,
          <article-title>seaborn: statistical data visualization</article-title>
          ,
          <source>Journal of Open Source Software</source>
          <volume>6</volume>
          (
          <year>2021</year>
          )
          <article-title>3021</article-title>
          . URL: https://doi.org/10.21105/joss.03021. doi:
          <volume>10</volume>
          . 21105/joss.03021.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>