=Paper=
{{Paper
|id=Vol-3784/short4
|storemode=property
|title=Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems
|pdfUrl=https://ceur-ws.org/Vol-3784/short4.pdf
|volume=Vol-3784
|authors=Laura Caspari,Kanishka Ghosh Dastidar,Saber Zerhoudi,Jelena Mitrovic,Michael Granitzer
|dblpUrl=https://dblp.org/rec/conf/ir-rag/CaspariDZMG24
}}
==Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems==
Beyond Benchmarks: Evaluating Embedding Model Similarity
for Retrieval Augmented Generation Systems
Laura Caspari1,* , Kanishka Ghosh Dastidar1 , Saber Zerhoudi1 , Jelena Mitrovic1 and Michael Granitzer1
1
University of Passau, Passau, Germany
Abstract
The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer
volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark
performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding
models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings
on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between
these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across
five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models
corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-π retrieval similarity
reveals high-variance at low π values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting
the highest similarity to OpenAI models.
Keywords
Large language model, Retrieval-augmented generation, Model similarity
1. Motivation perspective of comparing a single scalar value on an ar-
ray of downstream tasks, such a view of model similarity
Retrieval-Augmented Generation (RAG) is an emerging might overlook the nuances of the relative behaviour of
paradigm that helps mitigate the problems of factual hallu- the models [4]. As an example, the absolute difference in
cination [1] and outdated training data [2] of large language precision@k between two retrieval systems only provides a
models (LLMs) by providing these models with access to weak indication of the overlap of retrieved results. We argue
an external, non-parametric knowledge source (e.g. a doc- that identifying clusters of models with similar behaviour
ument corpus). Central to the functioning of RAG frame- would allow practitioners to construct smaller, yet diverse
works is the retrieval step, wherein a small subset of can- candidate pools of models to evaluate. Beyond model selec-
didate documents is retrieved from the document corpus, tion, as highlighted by Klabunde et al., [5], such an analysis
specific to the input query or prompt. This retrieval pro- also facilitates the identification of common factors con-
cess, known as dense-retrieval, hinges on text embeddings. tributing to strong performance, easier model ensembling,
Typically, the generation of these embeddings is assigned and detection of potential instances of unauthorized model
to an LLM, for which there are several options due to the reuse.
rapid evolution of the field. Consequently, selecting the In this paper, we analyze different LLMs in terms of the
most suitable embedding model from an array of available similarities of the embeddings they generate. Our similarity
choices emerges as a critical aspect in the development of analysis serves as an unsupervised evaluation framework
RAG systems. The information to guide this choice is cur- for these embedding models, in contrast to performance
rently primarily limited to architectural details (which are benchmarks that require labelled data. We do this from a
also on occasion scarce due to the prevalence of closed mod- dual perspective - we directly compare the embeddings us-
els) and performance benchmarks such as the Massive Text ing representational similarity measures. Additionally, we
Embedding Benchmark (MTEB) [3]. evaluate model similarity specifically in terms of their func-
We posit that an analysis of the similarity of the embed- tional impact on RAG systems i.e. we look at how similar
dings generated by these models would significantly aid the retrieved results are. Our evaluation focuses on sev-
this model selection process. Given the large number of eral prominent model families, to analyze similarities both
candidates and ever increasing scale of the models, a from- within and across them. We also compare proprietary mod-
scratch empirical evaluation of the embedding quality of els (such as those by OpenAI or Cohere) to open-sourced
these LLMs on a particular task can incur significant costs. ones in order to identify the most similar alternatives. Our
This challenge becomes especially pronounced when deal- experiments are carried out on five popular benchmark
ing with large-scale corpora comprising potentially millions datasets to determine if similarities between models are
of documents. While the relative performance scores of influenced by the choice of data. Our code is available at
these models on benchmark datasets offer the simplified https://github.com/casparil/embedding-model-similarity.
IR-RAG@SIGIRβ24: ACM SIGIR Workshop on Information Retrievalβs Role
in RAG Systems, July 18, 2024, Washington D.C., USA 2. Related Work
*
Corresponding author.
$ laura.caspari@uni-passau.de (L. Caspari); Studies evaluating similarities of neural networks fall into
kanishka.ghoshdastidar@uni-passau.de (K. G. Dastidar); two main categories: the first involves comparing activa-
saber.zerhoudi@uni-passau.de (S. Zerhoudi); tions of different models generated at any pair of layers for a
michael.granitzer@uni-passau.de (J. Mitrovic);
jelena.mitrovic@uni-passau.de (M. Granitzer)
specific input (representational similarity), while the second
0009-0002-6670-3211 (L. Caspari); 0000-0003-4171-0597 compares the model outputs (functional similarity). Raghu
(K. G. Dastidar); 0000-0003-2259-0462 (S. Zerhoudi); et al. [6] and Morcos et al. [7] propose measures building
0000-0003-3220-8749 (J. Mitrovic); 0000-0003-3566-5507 (M. Granitzer) on Canonical Correlation Analysis (CCA) [8], a statistical
Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Table 1 3. Methods
The datasets used for generating embeddings with their number
of queries and corpus size. We evaluate embedding model similarity using two ap-
proaches. The first directly compares the embeddings of text
Dataset Name Queries Corpus chunks generated by the models. The second approach is
TREC-COVID 50 171k specific to the RAG context, where we evaluate the similar-
NFCorpus 323 3.6k ity of retrieved results for a given query. These approaches
FiQA-2018 648 57k are discussed in detail in the following sections.
ArguAna 1406 8.67k
SciFact 300 5k
3.1. Pair-wise Embedding Similarity
There are several metrics defined in the literature that mea-
technique used to find the linear relationship between two sure representational similarity [4]. Many of these metrics
sets of variables by maximizing their correlation. Such com- require the representation spaces of the embeddings to be
parisons using CCA or variants thereof can be found in compared to be aligned and/or the dimensionality of the em-
several works [9], [10], [11]. Beyond CCA-based measures, beddings across the models to be identical. To avoid these
other works have also explored computing correlations [12] constraints, we pick Centered Kernel Alignment (CKA) [14]
and the mutual information [13] between neurons across with a linear kernel as our similarity measure.
networks. Kornblith et al. [14] propose Centered Kernel The measure computes similarity between two sets of
Alignment (CKA), which they show improves over several embeddings in two steps. First, for a set of embeddings,
similarity measures in identifying corresponding layers of the pair-wise similarity scores between all entries within
identical networks with different initializations. A diverse this set are computed using the kernel function. Thus, row
range of functional similarity evaluations have also been k of the resulting similarity matrix contains entries repre-
explored in the literature. A few examples include model- senting the similarity between embedding k and all other
stitching [15], [16], [17], disagreement measures between embeddings, including itself. Computing two such embed-
output classes [18], [19], and quantifying the similarity be- ding similarity matrices for different models with the same
tween the class-wise output probabilities [20]. We would number of embeddings then leads to two matrices E and
point the reader to the survey by Klabunde et al. [4] for a de- Eβ of matching dimensions. These are compared directly
tailed overview of representational and functional similarity in the second step with the Hilbert-Schmidt Independence
measures. Criterion (HSIC) [28] using the following formula:
Recently, a few works have also focused on specifically
π»ππΌπΆ(πΈ,πΈ β² )
evaluating the similarity of LLMs. While Wu et al. [21] πΆπΎπ΄(πΈ, πΈ β² ) = β
π»ππΌπΆ(πΈ,πΈ)π»ππΌπΆ(πΈ β² ,πΈ β² )
evaluate language models along several perspectives, such
as their representational and neuron-level similarities, their The resulting similarity scores are bounded in the interval
evaluation pre-dates the introduction of the recent wave [0, 1] with a score of 1 indicating equivalent representations.
of large scale models. Freestone and Santu [22] consider CKA assumes that representations are mean-centered.
similarities of word embeddings, and evaluate if LLMs dif-
fer significantly to classical encoding models in terms of 3.2. Retrieval Similarity
their representations. The works by Klabunde et al. [5]
and Brown et al. [23] are more recent, and evaluate the While a pair-wise comparison of embeddings offers insights
representational similarity of LLMs, with the latter also con- into the similarities of the representations learned by these
sidering the similarities between models of different sizes models, it does not suffice to quantify the similarities in
in the same model family. outcomes when these embedding models are deployed for
Much of the literature on evaluation of LLM embeddings specific tasks. Therefore, in context of RAG systems, we
focuses on their performance on downstream tasks, with consider the similarity of retrieved text chunks for a given
benchmarks such as BEIR [24] (for retrieval specifically) and query, when different embedding models are used. As a
MTEB [3] providing a unified view of embedding quality first step, for a given dataset, we generate embeddings of
across metrics and datasets. The metrics used here mostly queries and document chunks with each of the embedding
include typical information retrieval metrics such as pre- models. We then retrieve the π most similar embeddings
cision, recall, and mean reciprocal rank at certain cutoffs. in terms of the cosine similarity for a particular query. As
Some works specifically evaluate the retrieval components these embeddings correspond to specific chunks of text, we
in a RAG context, where they either use a dataset outside derive the sets of retrieved chunks C and Cβ for a pair of
of those included in the benchmarks [25] or where the eval- models. To measure the similarity of these sets, we use the
uation encompasses other aspects of the retriever beyond Jaccard similarity coefficient as follows:
the embedding model being used [26]. Another approach, β²
that does not rely on ground-truth labels, is given by the Re- π½ππππππ(πΆ, πΆ β² ) = |πΆβ©πΆ |
|πΆβͺπΆ β² |
trieval Augmented Generation Assessment (RAGAS) frame- Here, |πΆ β© πΆ β² | corresponds to the overlap in text chunks
work, which uses an LLM to determine the ratio of sentences by counting how often the two models retrieved the same
in the retrieved context that are relevant to the answer be- chunks. Similarly, we can compute the union |πΆ βͺ πΆ β² |,
ing generated [27]. To the best of our knowledge, there are which corresponds to all retrieved text chunks, counting
no works that evaluate the similarity of embedding models chunks present in both sets only once. The resulting score
from a retrieval perspective. is bounded in the interval [0, 1] with 1 indicating that both
models retrieved the same set of text chunks.
While Jaccard similarity computes the percentage to
which two sets overlap, it ignores the order in the sets. Rank
Table 2
We compare a diverse set of open source models from different families as well as proprietary models with varying performance
on MTEB.
Model Embedding dimension Max. Tokens MTEB Average Open Source
SFR-Embedding-Mistral 4096 32768 67.56 β
mxbai-embed-large-v1 1024 512 64.68 β
UAE-Large-V1 1024 512 64.64 β
text-embedding-3-large 3072 8191 64.59 β
Cohere embed-english-v3.0 1024 512 64.47 β
bge-large-en-v1.5 1024 512 64.23 β
bge-base-en-v1.5 768 512 63.55 β
gte-large 1024 512 63.13 β
gte-base 768 512 62.39 β
text-embedding-3-small 1536 8191 62.26 β
e5-large-v2 1024 512 62.25 β
bge-small-en-v1.5 384 512 62.17 β
e5-base-v2 768 512 61.5 β
gte-small 384 512 61.36 β
e5-small-v2 384 512 59.93 β
gtr-t5-large 768 512 58.28 β
sentence-t5-large 768 512 57.06 β
gtr-t5-base 768 512 56.19 β
sentence-t5-base 768 512 55.27 β
similarity [29], on the other hand, considers the order of [30], an open source embedding database. For each vector,
common elements, with closer elements having a higher we additionally store information about the document and
impact on the score. The measure assigns ranks to common text chunk ids it encodes to be able to match embeddings
text chunks according to their similarity to the query, i.e. generated by different models for evaluation.
ππΆ (π) = π if chunk π was the top-π retrieved result for the For model selection, we primarily use publicly available
query. Ranks are then compared using: models from the MTEB leaderboard [3]. We do not simply
pick the best performing models on the leaderboard; instead,
our choices are influenced by several factors. Firstly, we
2
π
πππ(ππΆ (π), ππΆ β² (π)) = (1+|ππΆ (π)βπ β² (π)|)(π
πΆ πΆ (π)+π β² (π))
πΆ
focus on analyzing similarities within and across model
families and pick models belonging to the e5 [31], t5 [32, 33],
With this, rank similarity for two sets of retrieved text bge [34], and gte [35] families. Secondly, we recognize
chunks C, Cβ is calculated as: that it might be of interest to users to avoid pay-by-token
policies of proprietary models by identifying similar open-
source alternatives. Therefore, we pick high-performing
π
ππππππ(πΆ, πΆ β² ) = π»(|πΆβ©πΆ
1
proprietary models, two from OpenAI (text-embedding-3-
β² |)
large and -small) [36] and one from Cohere (Cohere embed-
βοΈ
π
πππ(ππΆ (π), ππΆ β² (π))
πβ|πΆβ©πΆ β² | english-v3.0) [37]. We also compare the mxbai-embed-large-
v1 (mxbai) [38] and UAE-Large-V1 (UAE) [39] models, that
βοΈπΎ=|πΆβ©πΆ β² | 1 not only report very similar performances on MTEB, but also
with π»(|πΆ β© πΆ β² |) = π=1 π
denoting the K-th
harmonic number, normalizing the score. Like the other identical embedding dimensions, model size and memory
measures, rank similarity is bounded in the interval [0, 1] usage. Finally, we include SFR-Embedding-Mistral (Mistral)
with 1 indicating that all ranks are identical. [40] as the best-performing model on the leaderboard at the
time of our experiments. A detailed overview of all selected
models can be seen in Table 2.
4. Experimental Setup To compare embedding similarity across models and
datasets, we employ different strategies depending on the
The following paragraphs describe our choice of datasets similarity measure. We apply CKA by retrieving all em-
and models, along with details of the implementation of our beddings created by a model, matching embeddings using
experiments. their document and text chunk ids and then computing
As we focus on the retrieval component of RAG sys- their similarity for each of the five datasets. For Jaccard
tems, we select five publicly available datasets from the and rank similarity, we use sklearnβs NearestNeighbor class
BEIR benchmark [24]. As generating embeddings for large [41] to determine the the top-π retrieval results. We com-
datasets is a time-intensive process, especially for a larger pute Jaccard and rank scores per dataset, averaging over 25
number of models, we opt for five of the smaller datasets queries. For the NFCorpus dataset, we calculate retrieval
from the benchmark. This approach allows us to compare similarity for all possible π, i.e. using all embeddings gen-
embeddings generated by a variety of models while at the erated for the dataset. As calculating similarity for each
same time allowing us to evaluate embedding similarity ac- possible π is computationally expensive, we did not repeat
cross datasets. An overview of the datasets is shown in Table this for the remaining datasets and chose a smaller π value
1. For each dataset, we create embeddings by splitting docu- instead. Furthermore, as only a limited number of results
ments into text chunks such that each chunk contains 256 are to be provided as context to the generative model, ana-
tokens. The embedding vectors are stored with Chroma DB
1.0 gte-large_vs_SFR-Embedding-Mistral gte-large_vs_gte-base
gte-large_vs_UAE-Large-V1 gte-large_vs_gte-small
0.9 gte-large_vs_bge-base-en-v1.5 gte-large_vs_gtr-t5-base
gte-large_vs_bge-large-en-v1.5 gte-large_vs_gtr-t5-large
0.8 gte-large_vs_bge-small-en-v1.5 gte-large_vs_mxbai-embed-large-v1
gte-large_vs_e5-base-v2 gte-large_vs_sentence-t5-base
0.7 gte-large_vs_e5-large-v2 gte-large_vs_sentence-t5-large
1.00 0.81 0.64 0.63 0.64 0.64 0.61 0.63 0.65 0.62 0.71 0.66 0.67 0.68 0.70 0.63 0.67 0.66 0.63 gtr-t5-base gte-large_vs_e5-small-v2 gte-large_vs_text-embedding-3-large
0.81 1.00 0.67 0.67 0.67 0.66 0.64 0.66 0.68 0.64 0.70 0.74 0.72 0.72 0.74 0.68 0.71 0.69 0.66 gtr-t5-large gte-large_vs_embed-english-v3.0 gte-large_vs_text-embedding-3-small
0.64 0.67 1.00 0.99 0.98 0.86 0.85 0.93 0.88 0.86 0.73 0.73 0.78 0.80 0.80 0.76 0.79 0.78 0.74 mxbai-embed-large-v1
0.6
0.63 0.67 0.99 1.00 0.99 0.84 0.82 0.90 0.86 0.83 0.71 0.72 0.76 0.78 0.78 0.75 0.78 0.77 0.73 UAE-Large-V1
0.64 0.67 0.98 0.99 1.00 0.84 0.81 0.89 0.85 0.82 0.71 0.72 0.76 0.78 0.78 0.76 0.79 0.77 0.74 bge-large-en-v1.5
0.6
0.64 0.66 0.86 0.84 0.84 1.00 0.93 0.86 0.86 0.85 0.72 0.71 0.74 0.76 0.77 0.72 0.76 0.75 0.75 bge-small-en-v1.5
0.61 0.64 0.85 0.82 0.81 0.93 1.00 0.90 0.86 0.91 0.72 0.71 0.75 0.77 0.78 0.71 0.76 0.74 0.73 gte-small 0.5
0.63 0.66 0.93 0.90 0.89 0.86 0.90 1.00 0.89 0.92 0.74 0.74 0.78 0.81 0.81 0.75 0.81 0.76 0.73 gte-large
0.4
0.65 0.68 0.88 0.86 0.85 0.86 0.86 0.89 1.00 0.94 0.73 0.72 0.77 0.80 0.81 0.76 0.81 0.79 0.74 bge-base-en-v1.5
0.62 0.64 0.86 0.83 0.82 0.85 0.91 0.92 0.94 1.00 0.72 0.71 0.77 0.79 0.80 0.73 0.79 0.76 0.72 gte-base 0.2
0.71 0.70 0.73 0.71 0.71 0.72 0.72 0.74 0.73 0.72 1.00 0.87 0.71 0.74 0.76 0.70 0.73 0.71 0.70 sentence-t5-base 0.4
0.66 0.74 0.73 0.72 0.72 0.71 0.71 0.74 0.72 0.71 0.87 1.00 0.73 0.76 0.77 0.71 0.75 0.72 0.69 sentence-t5-large 0.00 10 20 30 40 50
0.67 0.72 0.78 0.76 0.76 0.74 0.75 0.78 0.77 0.77 0.71 0.73 1.00 0.87 0.84 0.76 0.79 0.76 0.73 SFR-Embedding-Mistral
0.68 0.72 0.80 0.78 0.78 0.76 0.77 0.81 0.80 0.79 0.74 0.76 0.87 1.00 0.90 0.78 0.81 0.77 0.74 text-embedding-3-large
0.3
0.70 0.74 0.80 0.78 0.78 0.77 0.78 0.81 0.81 0.80 0.76 0.77 0.84 0.90 1.00 0.78 0.82 0.78 0.75 text-embedding-3-small
0.63 0.68 0.76 0.75 0.76 0.72 0.71 0.75 0.76 0.73 0.70 0.71 0.76 0.78 0.78 1.00 0.93 0.83 0.79 e5-large-v2
0.67 0.71 0.79 0.78 0.79 0.76 0.76 0.81 0.81 0.79 0.73 0.75 0.79 0.81 0.82 0.93 1.00 0.81 0.78 embed-english-v3.0
0.66 0.69 0.78 0.77 0.77 0.75 0.74 0.76 0.79 0.76 0.71 0.72 0.76 0.77 0.78 0.83 0.81 1.00 0.81 e5-base-v2 0.2
0.63 0.66 0.74 0.73 0.74 0.75 0.73 0.73 0.74 0.72 0.70 0.69 0.73 0.74 0.75 0.79 0.78 0.81 1.00 e5-small-v2
gtr-t5-base
UAE-Large-V1
gte-base
gtr-t5-large
mxbai-embed-large-v1
bge-large-en-v1.5
bge-small-en-v1.5
gte-small
gte-large
bge-base-en-v1.5
sentence-t5-base
SFR-Embedding-Mistral
e5-base-v2
e5-small-v2
sentence-t5-large
text-embedding-3-large
text-embedding-3-small
e5-large-v2
embed-english-v3.0
0.1
0 1000 2000 3000 4000 5000 6000
Figure 1: Mean CKA similarity across all five datasets. Models Figure 2: Rank similarity over all π on NFCorpus, comparing
tend to be most similar to models belonging to their own family, gte-large to all other models. Scores are highest and vary most
though some interesting inter-family patterns are visible as well. for small π, but then drop quickly before stabilizing for larger π.
lyzing retrieval similarity at low π values for e.g. top-10 is UAE, mxbai and bge-large, whose scores suggest almost
of most interest. As we are interested in identifying clusters perfect embedding similarity. In fact, the similarity score of
of similar models, we also perform a hierarchical clustering bge-large to these two models is much higher than to other
on heatmap values using Seaborn [42]. The following sec- bge models.
tion describes the results of our evaluation for the different Shifting our attention to top-π retrieval similarity, clusters
measures. vary depending on the π value. Figure 3 illustrates how
Jaccard similarity evolves over π on NFCorpus. The first
plot displays Jaccard scores between bge-large and all other
5. Results models, while the second plot illustrates the scores for gte-
large. For extremely low π, we observe some peaks for
To evaluate how similar embeddings generated by different nearly all models, followed by a noticeable drop in similarity.
models are, we will first consider model families, checking if Of course, for larger π, the scores converge to one. Re-
their pairwise and top-k similarity scores are highest within affirming our earlier observations with the CKA metric,
their family. Subsequently, we will identify the open source bge-large demonstrates high retrieval similarity with UAE
models which are most similar to our chosen proprietary and mxbai. Similarity to the remaining models is much
models. lower, with the highest scores for bge-base and bge-small
for larger π. However, especially for small π, there is high
5.1. Intra- and Inter-Family Clusters variance in similarity score, with models from other families,
e.g. Mistral or gte-large sometimes achieving higher scores
Comparing embeddings directly with CKA shows high sim-
than the bge models. A similar pattern can also be observed
ilarity across most of the models, albeit with some variance.
in the second plot, where Jaccard similarity for gte-large
These scores allow us to identify certain clusters of models.
is highest within its family for larger π, but models like
Figure 1 shows the pair-wise CKA scores of all models aver-
mxbai or bge-base sometimes reporting higher similarity
aged across the five datasets. As expected, scores for most
for small π. Therefore, the clusters we identified through
models are highest within their own family. This holds true
our CKA analysis are only truly reflected in these plots for
for the gtr-t5, sentence-t5 and text-embedding-3 (OpenAI)
large values of π. This suggest that in real-world use cases,
models. Although the sentence-t5 and gtr-t5 models are
where the top-π are crucial, such representational similarity
closely related, they do not exhibit significantly higher sim-
measures might not provide the full picture. The plots for
ilarity with each other compared to the remaining models.
other model families provide nearly identical insights as
From an inter-family perspective, we observe high sim-
those in the second plot in Figure 3 and thus we do not
ilarity between the bge and gte models. For some models
present them for sake of brevity.
in these two families, interestingly, the highest similarity
For rank similarity, scores peak for small π and then
scores rather correspond to inter-family counterparts with
quickly start to drop until they approach a low stable score
matching embedding dimensions than with models in the
for larger π as shown in Figure 2 for gte-large. Once again,
same family. Specifically, gte-small reports the highest simi-
the bge/UAE/mxbai inter-family cluster shows the highest
larity to bge-small and gte-base to bge-base. On the other
similarity. In contrast to Jaccard similarity, the clusters that
hand, gte-large shows slightly higher similarity to bge-base
could be observed for CKA do not always show for rank
than bge-large and thus to a model with a lower embedding
similarity. As can be seen in Figure 2, the model with the
dimension. Another inter-family cluster is formed by the
highest rank similarity to gte-large is mxbai, rather than
three models with the highest CKA scores overall, namely
bge-large-en-v1.5_vs_SFR-Embedding-Mistral bge-large-en-v1.5_vs_gte-large gte-large_vs_SFR-Embedding-Mistral gte-large_vs_gte-base
bge-large-en-v1.5_vs_UAE-Large-V1 bge-large-en-v1.5_vs_gte-small gte-large_vs_UAE-Large-V1 gte-large_vs_gte-small
bge-large-en-v1.5_vs_bge-base-en-v1.5 bge-large-en-v1.5_vs_gtr-t5-base gte-large_vs_bge-base-en-v1.5 gte-large_vs_gtr-t5-base
bge-large-en-v1.5_vs_bge-small-en-v1.5 bge-large-en-v1.5_vs_gtr-t5-large gte-large_vs_bge-large-en-v1.5 gte-large_vs_gtr-t5-large
bge-large-en-v1.5_vs_e5-base-v2 bge-large-en-v1.5_vs_mxbai-embed-large-v1 gte-large_vs_bge-small-en-v1.5 gte-large_vs_mxbai-embed-large-v1
bge-large-en-v1.5_vs_e5-large-v2 bge-large-en-v1.5_vs_sentence-t5-base gte-large_vs_e5-base-v2 gte-large_vs_sentence-t5-base
bge-large-en-v1.5_vs_e5-small-v2 bge-large-en-v1.5_vs_sentence-t5-large gte-large_vs_e5-large-v2 gte-large_vs_sentence-t5-large
bge-large-en-v1.5_vs_embed-english-v3.0 bge-large-en-v1.5_vs_text-embedding-3-large gte-large_vs_e5-small-v2 gte-large_vs_text-embedding-3-large
bge-large-en-v1.5_vs_gte-base bge-large-en-v1.5_vs_text-embedding-3-small gte-large_vs_embed-english-v3.0 gte-large_vs_text-embedding-3-small
1.0 1.0
0.8 0.8
0.6 0.6
0.8 0.6
0.4 0.5
0.4 0.6
0.4
0.4 0.3
0.2
0.2 0.2
0.2 0.1
0.00 10 20 30 40 50 0.00 10 20 30 40 50
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
(a) (b)
Figure 3: Jaccard similarity over all π on NFCorpus, comparing bge-large (a) and gte-large (b) to all other models. While bge-large shows
high similarity to UAE-Large-v1 and mxbai-embed-large-v1, scores for gte-large are clustered much closer. Jaccard similarity seems to be
most unstable for small values of π, which would commonly be chosen for retrieval tasks.
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 1.00 0.68 0.60 0.24 0.25 0.33 0.30 0.27 0.26 0.29 0.30 0.29 0.26 0.23 0.18 0.16 0.23 0.20 0.22 bge-large-en-v1.5 1.00 0.43 0.29 0.32 0.24 0.35 0.26 0.24 0.27 0.32 0.28 0.31 0.34 0.35 0.28 0.30 0.31 0.26 0.28 sentence-t5-base
0.68 1.00 0.83 0.25 0.28 0.42 0.35 0.32 0.28 0.29 0.28 0.30 0.28 0.24 0.20 0.17 0.24 0.21 0.22 UAE-Large-V1 0.43 1.00 0.28 0.34 0.24 0.30 0.28 0.27 0.26 0.21 0.29 0.28 0.37 0.32 0.32 0.34 0.36 0.29 0.28 sentence-t5-large
0.60 0.83 1.00 0.26 0.31 0.47 0.37 0.36 0.30 0.30 0.28 0.31 0.29 0.25 0.21 0.18 0.26 0.22 0.22 mxbai-embed-large-v1 0.29 0.28 1.00 0.40 0.26 0.31 0.27 0.27 0.25 0.26 0.25 0.24 0.29 0.31 0.32 0.31 0.30 0.21 0.20 gtr-t5-base
0.24 0.25 0.26 1.00 0.31 0.25 0.26 0.22 0.19 0.18 0.19 0.20 0.23 0.18 0.17 0.14 0.16 0.16 0.17 bge-small-en-v1.5 0.32 0.34 0.40 1.00 0.29 0.31 0.28 0.27 0.30 0.26 0.28 0.30 0.36 0.37 0.39 0.35 0.34 0.27 0.25 gtr-t5-large
0.25 0.28 0.31 0.31 1.00 0.38 0.29 0.35 0.29 0.27 0.24 0.26 0.28 0.25 0.18 0.16 0.19 0.20 0.20 gte-small 0.24 0.24 0.26 0.29 1.00 0.46 0.38 0.42 0.43 0.35 0.32 0.35 0.37 0.26 0.32 0.27 0.28 0.28 0.32 bge-small-en-v1.5
0.33 0.42 0.47 0.25 0.38 1.00 0.36 0.41 0.30 0.27 0.25 0.30 0.26 0.25 0.17 0.15 0.22 0.22 0.20 gte-large 0.35 0.30 0.31 0.31 0.46 1.00 0.48 0.52 0.49 0.43 0.44 0.42 0.45 0.37 0.39 0.40 0.40 0.36 0.36 bge-base-en-v1.5
0.30 0.35 0.37 0.26 0.29 0.36 1.00 0.46 0.32 0.28 0.28 0.29 0.30 0.24 0.20 0.19 0.23 0.25 0.22 bge-base-en-v1.5 0.26 0.28 0.27 0.28 0.38 0.48 1.00 0.62 0.59 0.39 0.39 0.39 0.39 0.32 0.36 0.33 0.33 0.33 0.31 bge-large-en-v1.5
0.27 0.32 0.36 0.22 0.35 0.41 0.46 1.00 0.33 0.32 0.29 0.30 0.27 0.22 0.17 0.18 0.21 0.24 0.23 gte-base 0.24 0.27 0.27 0.27 0.42 0.52 0.62 1.00 0.74 0.42 0.43 0.42 0.41 0.33 0.34 0.30 0.34 0.32 0.32 UAE-Large-V1
0.26 0.28 0.30 0.19 0.29 0.30 0.32 0.33 1.00 0.36 0.32 0.42 0.30 0.30 0.23 0.21 0.27 0.23 0.23 text-embedding-3-small 0.27 0.26 0.25 0.30 0.43 0.49 0.59 0.74 1.00 0.46 0.44 0.43 0.45 0.35 0.35 0.30 0.34 0.35 0.34 mxbai-embed-large-v1
0.29 0.29 0.30 0.18 0.27 0.27 0.28 0.32 0.36 1.00 0.39 0.40 0.30 0.33 0.21 0.23 0.27 0.21 0.24 embed-english-v3.0 0.32 0.21 0.26 0.26 0.35 0.43 0.39 0.42 0.46 1.00 0.43 0.42 0.34 0.31 0.35 0.29 0.34 0.30 0.31 gte-large
0.30 0.28 0.28 0.19 0.24 0.25 0.28 0.29 0.32 0.39 1.00 0.42 0.29 0.22 0.18 0.22 0.23 0.19 0.23 SFR-Embedding-Mistral 0.28 0.29 0.25 0.28 0.32 0.44 0.39 0.43 0.44 0.43 1.00 0.46 0.37 0.32 0.34 0.39 0.35 0.36 0.28 gte-base
0.29 0.30 0.31 0.20 0.26 0.30 0.29 0.30 0.42 0.40 0.42 1.00 0.30 0.30 0.23 0.21 0.24 0.22 0.25 text-embedding-3-large 0.31 0.28 0.24 0.30 0.35 0.42 0.39 0.42 0.43 0.42 0.46 1.00 0.39 0.30 0.29 0.31 0.28 0.33 0.29 gte-small
0.26 0.28 0.29 0.23 0.28 0.26 0.30 0.27 0.30 0.30 0.29 0.30 1.00 0.37 0.29 0.18 0.24 0.21 0.22 e5-base-v2 0.34 0.37 0.29 0.36 0.37 0.45 0.39 0.41 0.45 0.34 0.37 0.39 1.00 0.43 0.42 0.48 0.42 0.43 0.44 e5-base-v2
0.23 0.24 0.25 0.18 0.25 0.25 0.24 0.22 0.30 0.33 0.22 0.30 0.37 1.00 0.29 0.16 0.22 0.18 0.18 e5-large-v2 0.35 0.32 0.31 0.37 0.26 0.37 0.32 0.33 0.35 0.31 0.32 0.30 0.43 1.00 0.44 0.42 0.49 0.34 0.36 text-embedding-3-small
0.18 0.20 0.21 0.17 0.18 0.17 0.20 0.17 0.23 0.21 0.18 0.23 0.29 0.29 1.00 0.16 0.21 0.17 0.14 e5-small-v2 0.28 0.32 0.32 0.39 0.32 0.39 0.36 0.34 0.35 0.35 0.34 0.29 0.42 0.44 1.00 0.50 0.46 0.31 0.28 SFR-Embedding-Mistral
0.16 0.17 0.18 0.14 0.16 0.15 0.19 0.18 0.21 0.23 0.22 0.21 0.18 0.16 0.16 1.00 0.38 0.20 0.22 gtr-t5-base 0.30 0.34 0.31 0.35 0.27 0.40 0.33 0.30 0.30 0.29 0.39 0.31 0.48 0.42 0.50 1.00 0.52 0.34 0.38 embed-english-v3.0
0.23 0.24 0.26 0.16 0.19 0.22 0.23 0.21 0.27 0.27 0.23 0.24 0.24 0.22 0.21 0.38 1.00 0.21 0.24 gtr-t5-large 0.31 0.36 0.30 0.34 0.28 0.40 0.33 0.34 0.34 0.34 0.35 0.28 0.42 0.49 0.46 0.52 1.00 0.38 0.39 text-embedding-3-large
0.20 0.21 0.22 0.16 0.20 0.22 0.25 0.24 0.23 0.21 0.19 0.22 0.21 0.18 0.17 0.20 0.21 1.00 0.34 sentence-t5-base 0.26 0.29 0.21 0.27 0.28 0.36 0.33 0.32 0.35 0.30 0.36 0.33 0.43 0.34 0.31 0.34 0.38 1.00 0.38 e5-large-v2
0.22 0.22 0.22 0.17 0.20 0.20 0.22 0.23 0.23 0.24 0.23 0.25 0.22 0.18 0.14 0.22 0.24 0.34 1.00 sentence-t5-large 0.28 0.28 0.20 0.25 0.32 0.36 0.31 0.32 0.34 0.31 0.28 0.29 0.44 0.36 0.28 0.38 0.39 0.38 1.00 e5-small-v2
UAE-Large-V1
SFR-Embedding-Mistral
e5-base-v2
gte-base
bge-large-en-v1.5
mxbai-embed-large-v1
bge-small-en-v1.5
gte-small
gte-large
bge-base-en-v1.5
gte-base
e5-small-v2
gtr-t5-base
sentence-t5-base
gtr-t5-base
UAE-Large-V1
bge-small-en-v1.5
bge-base-en-v1.5
gte-small
e5-base-v2
SFR-Embedding-Mistral
text-embedding-3-small
embed-english-v3.0
e5-large-v2
gtr-t5-large
sentence-t5-base
text-embedding-3-large
sentence-t5-large
sentence-t5-large
gtr-t5-large
bge-large-en-v1.5
mxbai-embed-large-v1
gte-large
text-embedding-3-small
embed-english-v3.0
e5-large-v2
e5-small-v2
text-embedding-3-large
(a) (b)
Figure 4: Jaccard (a) and rank similarity (b) for the top-10 retrieved text chunks averaged over 25 queries on NFCorpus. The clusters
vary slightly depending on the measure, as do the scores. Models tend to be most similar to models from their own family. However,
some inter-family clusters are visible as well.
another gte model. Even so, the previously observed clus- card similarity between models across datasets. A striking
ters also tend to appear for rank similarity, though they insight here is that even the most similar models only report
vary more depending on the models and dataset. Gener- a Jaccard similarity of higher than 0.6, with the majority
ally, scores for nearly all models are rather small for larger less than 0.5. This is of great relevance to practitioners, as
π, indicating low rank similarity. For small π, results vary it would imply that even using embeddings from models
more and differences between individual models are more that report high representational similarity scores may yield
pronounced. little overlap in retrieved text chunks. As earlier, the cluster
As retrieval similarity at small π is of most interest from a of UAE/mxbai/bge-large is the most prominent one with
practical perspective, we take a closer look at top-10 Jaccard the highest scores. Intra-family scores tend to be the high-
similarity. The heatmaps in Figures 4-6 show the top-10 Jac- est for most models, i.e. t5 and OpenAI. Depending on the
1.0 1.0
SFR-Embedding-Mistral 1.00 0.29 0.32 0.30 0.21 0.29 0.24 0.19 0.34 0.32 0.26 0.24 0.22 0.28 0.30 0.13 0.16 0.47 0.38 SFR-Embedding-Mistral 1.00 0.45 0.47 0.43 0.39 0.47 0.52 0.50 0.44 0.43 0.42 0.36 0.49 0.52 0.44 0.50 0.49 0.61 0.55
UAE-Large-V1 0.29 1.00 0.36 0.59 0.23 0.27 0.23 0.19 0.35 0.35 0.36 0.28 0.18 0.21 0.76 0.12 0.14 0.31 0.29 UAE-Large-V1 0.45 1.00 0.55 0.84 0.43 0.42 0.42 0.41 0.41 0.53 0.56 0.44 0.42 0.43 0.89 0.41 0.39 0.50 0.46
bge-base-en-v1.5 0.32 0.36 1.00 0.33 0.26 0.28 0.24 0.19 0.37 0.44 0.30 0.29 0.18 0.20 0.35 0.13 0.12 0.32 0.34 bge-base-en-v1.5 0.47 0.55 1.00 0.52 0.51 0.47 0.45 0.49 0.46 0.59 0.50 0.46 0.46 0.49 0.56 0.51 0.51 0.56 0.53
0.9
bge-large-en-v1.5 0.30 0.59 0.33 1.00 0.24 0.23 0.22 0.16 0.35 0.35 0.31 0.25 0.18 0.20 0.59 0.10 0.13 0.30 0.30 bge-large-en-v1.5 0.43 0.84 0.52 1.00 0.42 0.42 0.38 0.41 0.42 0.50 0.50 0.42 0.41 0.43 0.80 0.41 0.39 0.47 0.44
0.8
bge-small-en-v1.5 0.21 0.23 0.26 0.24 1.00 0.19 0.17 0.13 0.26 0.25 0.21 0.26 0.13 0.16 0.25 0.10 0.10 0.23 0.20 bge-small-en-v1.5 0.39 0.43 0.51 0.42 1.00 0.42 0.38 0.43 0.41 0.50 0.48 0.63 0.41 0.42 0.45 0.51 0.43 0.46 0.45
e5-base-v2 0.29 0.27 0.28 0.23 0.19 1.00 0.34 0.24 0.34 0.27 0.26 0.25 0.18 0.23 0.27 0.12 0.14 0.29 0.32 e5-base-v2 0.47 0.42 0.47 0.42 0.42 1.00 0.56 0.56 0.49 0.44 0.40 0.37 0.45 0.47 0.43 0.43 0.43 0.54 0.53
0.8
e5-large-v2 0.24 0.23 0.24 0.22 0.17 0.34 1.00 0.24 0.30 0.22 0.24 0.22 0.15 0.17 0.25 0.10 0.13 0.23 0.25 e5-large-v2 0.52 0.42 0.45 0.38 0.38 0.56 1.00 0.54 0.49 0.39 0.39 0.34 0.48 0.50 0.40 0.42 0.47 0.53 0.54
e5-small-v2 0.19 0.19 0.19 0.16 0.13 0.24 0.24 1.00 0.19 0.17 0.18 0.16 0.13 0.14 0.19 0.10 0.10 0.22 0.21 e5-small-v2 0.50 0.41 0.49 0.41 0.43 0.56 0.54 1.00 0.51 0.40 0.38 0.37 0.46 0.50 0.41 0.43 0.45 0.54 0.52
embed-english-v3.0 0.34 0.35 0.37 0.35 0.26 0.34 0.30 0.19 1.00 0.30 0.29 0.26 0.21 0.23 0.37 0.12 0.14 0.33 0.36 0.6 embed-english-v3.0 0.44 0.41 0.46 0.42 0.41 0.49 0.49 0.51 1.00 0.39 0.35 0.36 0.45 0.43 0.41 0.38 0.37 0.48 0.49 0.7
gte-base 0.32 0.35 0.44 0.35 0.25 0.27 0.22 0.17 0.30 1.00 0.43 0.38 0.16 0.21 0.37 0.12 0.14 0.33 0.33 gte-base 0.43 0.53 0.59 0.50 0.50 0.44 0.39 0.40 0.39 1.00 0.55 0.56 0.43 0.43 0.56 0.48 0.43 0.53 0.50
gte-large 0.26 0.36 0.30 0.31 0.21 0.26 0.24 0.18 0.29 0.43 1.00 0.38 0.17 0.20 0.41 0.10 0.12 0.29 0.28 gte-large 0.42 0.56 0.50 0.50 0.48 0.40 0.39 0.38 0.35 0.55 1.00 0.54 0.40 0.40 0.57 0.44 0.45 0.48 0.47
gte-small 0.24 0.28 0.29 0.25 0.26 0.25 0.22 0.16 0.26 0.38 0.38 1.00 0.17 0.19 0.32 0.09 0.11 0.25 0.23 gte-small 0.36 0.44 0.46 0.42 0.63 0.37 0.34 0.37 0.36 0.56 0.54 1.00 0.37 0.37 0.47 0.46 0.40 0.44 0.44 0.6
gtr-t5-base 0.22 0.18 0.18 0.18 0.13 0.18 0.15 0.13 0.21 0.16 0.17 0.17 1.00 0.35 0.19 0.11 0.12 0.24 0.22 0.4 gtr-t5-base 0.49 0.42 0.46 0.41 0.41 0.45 0.48 0.46 0.45 0.43 0.40 0.37 1.00 0.63 0.43 0.47 0.49 0.51 0.50
gtr-t5-large 0.28 0.21 0.20 0.20 0.16 0.23 0.17 0.14 0.23 0.21 0.20 0.19 0.35 1.00 0.23 0.10 0.14 0.28 0.25 gtr-t5-large 0.52 0.43 0.49 0.43 0.42 0.47 0.50 0.50 0.43 0.43 0.40 0.37 0.63 1.00 0.44 0.47 0.51 0.53 0.53
mxbai-embed-large-v1 0.30 0.76 0.35 0.59 0.25 0.27 0.25 0.19 0.37 0.37 0.41 0.32 0.19 0.23 1.00 0.12 0.14 0.34 0.30 mxbai-embed-large-v1 0.44 0.89 0.56 0.80 0.45 0.43 0.40 0.41 0.41 0.56 0.57 0.47 0.43 0.44 1.00 0.44 0.42 0.50 0.47 0.5
sentence-t5-base 0.13 0.12 0.13 0.10 0.10 0.12 0.10 0.10 0.12 0.12 0.10 0.09 0.11 0.10 0.12 1.00 0.25 0.14 0.13 sentence-t5-base 0.50 0.41 0.51 0.41 0.51 0.43 0.42 0.43 0.38 0.48 0.44 0.46 0.47 0.47 0.44 1.00 0.61 0.54 0.51
sentence-t5-large 0.16 0.14 0.12 0.13 0.10 0.14 0.13 0.10 0.14 0.14 0.12 0.11 0.12 0.14 0.14 0.25 1.00 0.16 0.17 0.2 sentence-t5-large 0.49 0.39 0.51 0.39 0.43 0.43 0.47 0.45 0.37 0.43 0.45 0.40 0.49 0.51 0.42 0.61 1.00 0.55 0.53
text-embedding-3-large 0.47 0.31 0.32 0.30 0.23 0.29 0.23 0.22 0.33 0.33 0.29 0.25 0.24 0.28 0.34 0.14 0.16 1.00 0.44 text-embedding-3-large 0.61 0.50 0.56 0.47 0.46 0.54 0.53 0.54 0.48 0.53 0.48 0.44 0.51 0.53 0.50 0.54 0.55 1.00 0.66 0.4
text-embedding-3-small 0.38 0.29 0.34 0.30 0.20 0.32 0.25 0.21 0.36 0.33 0.28 0.23 0.22 0.25 0.30 0.13 0.17 0.44 1.00 text-embedding-3-small 0.55 0.46 0.53 0.44 0.45 0.53 0.54 0.52 0.49 0.50 0.47 0.44 0.50 0.53 0.47 0.51 0.53 0.66 1.00
SFR-Embedding-Mistral
UAE-Large-V1
SFR-Embedding-Mistral
bge-base-en-v1.5
bge-large-en-v1.5
bge-small-en-v1.5
UAE-Large-V1
e5-base-v2
e5-large-v2
e5-small-v2
embed-english-v3.0
gte-base
gte-large
gte-small
mxbai-embed-large-v1
gtr-t5-base
gtr-t5-large
sentence-t5-base
sentence-t5-large
text-embedding-3-large
text-embedding-3-small
bge-base-en-v1.5
bge-large-en-v1.5
bge-small-en-v1.5
e5-base-v2
e5-large-v2
e5-small-v2
embed-english-v3.0
gte-base
gte-large
gte-small
mxbai-embed-large-v1
gtr-t5-base
gtr-t5-large
sentence-t5-base
sentence-t5-large
text-embedding-3-large
text-embedding-3-small
(a) (b)
Figure 5: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on SciFact (a) and ArguAna (b). The UAE and
mxbai models show high levels of similarity along with bge-large. The remaining models tend to show the highest similarity within their
own family with the exception of the bge/gte inter-family cluster.
dataset, this also applies to gte and e5 models, although retrieval similarity between Mistral and OpenAI models
Jaccard similarity to models from other families is some- is only low to moderate. On smaller datasets, the highest
times higher. We also note that for the two larger datasets Jaccard similarity to text-embedding-3-large only reaches
FiQA-2018 and TREC-COVID, the similarity scores are gen- about 0.6 (see Figure 5), while on TREC-COVID, the largest
erally substantially lower, as can be seen in Figure 6. For dataset, Jaccard similarity goes down to merely 0.18 (see
the smaller datasets, Jaccard similarity is generally higher, Figure 6). For Cohereβs model, the most similar model for
though results differ depending on the data (see Figures 4 top-10 Jaccard similarity is different for each dataset, with
and 5). Similar observations can be made for rank similarity, the highest scores of 0.51 occurring on ArguAna shwon in
although the appearance of family clusters is more depen- Figure 5. For all proprietary models, even the best retrieval
dent on the dataset. Larger datasets also lead to lower scores. similarity at top-10 still suggests that the embeddings that
These results illustrate that while family clusters can still would be presented to an LLM can differ notably. Once
be perceived at small π, they are not as prominent as they again, we could also observe dataset-dependent variance in
are for larger π. Furthermore, the top-10 retrieved results scores, with lower retrieval similarity on larger datasets.
differ substantially for most models and datasets and their
similarity might be dependent on the dataset itself.
6. Discussion
5.2. Open Source Alternatives to While a pair-wise comparison of embeddings using CKA
Proprietary Models shows intra- and inter-family model clusters, retrieval simi-
larity over different π offers a more nuanced picture. Espe-
We explicitly included proprietary models in our analysis to cially for small π, which are of most interest from a practical
check which open source models are the best - which in our perspective, retrieval similarity varies. When comparing
case means the most similar - alternative. The CKA scores the top-10 retrieved text chunks, the low Jaccard similarity
in Figure 1 indicate that embeddings generated by OpenAIβs scores indicate little overlap in retrieved chunks, even when
models (text-embedding-3-large/-small) are highly similar to CKA scores are high. Especially for the two larger datasets
those generated by Mistral, while the Cohere model (embed- FiQA-2018 and TREC-COVID, these scores are extremely
english-v3.0) demonstrates high similarity to e5-large-v2. low. As RAG systems usually operate on millions of em-
These observations do not entirely extend to retrieval sim- beddings, our datasets are comparatively small. Therefore,
ilarity, especially for Cohere. While Mistral is still the most should a general trend of larger datasets leading to lower
similar model to OpenAIβs for larger π across all datasets, retrieval similarity exist, text chunks retrieved by differ-
there is no consistently most similar model for Cohere. ent models in a regular use case might be nearly distinct
Rather, the model varies depending on the dataset and mea- for small π. Overall, our results suggest that even though
sure - Jaccard and rank similarity - sometimes showing high- embeddings seem rather similar when compared directly,
est similarity to e5-large-v2, but sometimes also to other retrieval performance can still vary substantially, is most
models like Mistral. Taking a closer look at top-10 similar- unstable for π values that are commonly used in RAG sys-
ity, Mistral still largely exhibits the highest similarity to the tems and also dataset-dependent. Retrieved chunks at small
OpenAI models, especially to text-embedding-3-large. For π show the least overlap, leading to high differences in data
text-embedding-3-small, scores on all datasets are rather that would be presented to an LLM as additional context.
close to those of other models. In absolute terms, however,
1.0 1.0
SFR-Embedding-Mistral 1.00 0.11 0.11 0.12 0.09 0.13 0.13 0.10 0.14 0.12 0.12 0.12 0.15 0.12 0.11 0.11 0.25 0.18 SFR-Embedding-Mistral 1.00 0.08 0.12 0.10 0.08 0.12 0.06 0.04 0.09 0.07 0.08 0.09 0.11 0.08 0.05 0.05 0.18 0.18
UAE-Large-V1 0.11 1.00 0.20 0.52 0.16 0.12 0.11 0.10 0.21 0.35 0.22 0.12 0.11 0.64 0.13 0.09 0.15 0.17 UAE-Large-V1 0.08 1.00 0.23 0.54 0.25 0.11 0.07 0.04 0.13 0.21 0.14 0.12 0.11 0.71 0.10 0.08 0.09 0.10
bge-base-en-v1.5 0.11 0.20 1.00 0.17 0.19 0.13 0.11 0.09 0.26 0.20 0.16 0.10 0.10 0.20 0.11 0.07 0.11 0.14 bge-base-en-v1.5 0.12 0.23 1.00 0.18 0.20 0.14 0.08 0.05 0.24 0.19 0.16 0.09 0.11 0.24 0.11 0.10 0.12 0.15
bge-large-en-v1.5 0.12 0.52 0.17 1.00 0.12 0.13 0.10 0.09 0.19 0.28 0.17 0.12 0.11 0.49 0.12 0.10 0.15 0.17 bge-large-en-v1.5 0.10 0.54 0.18 1.00 0.21 0.10 0.06 0.03 0.12 0.14 0.12 0.13 0.13 0.51 0.09 0.07 0.09 0.11 0.8
0.8
bge-small-en-v1.5 0.09 0.16 0.19 0.12 1.00 0.10 0.10 0.11 0.14 0.14 0.24 0.11 0.10 0.15 0.08 0.06 0.10 0.10 bge-small-en-v1.5 0.08 0.25 0.20 0.21 1.00 0.11 0.09 0.05 0.12 0.18 0.20 0.11 0.09 0.23 0.09 0.05 0.09 0.09
e5-base-v2 0.13 0.12 0.13 0.13 0.10 1.00 0.19 0.18 0.15 0.13 0.13 0.10 0.10 0.13 0.09 0.08 0.14 0.15 e5-base-v2 0.12 0.11 0.14 0.10 0.11 1.00 0.22 0.16 0.09 0.12 0.11 0.09 0.10 0.12 0.08 0.05 0.13 0.15
e5-large-v2 0.13 0.11 0.11 0.10 0.10 0.19 1.00 0.16 0.10 0.09 0.09 0.10 0.11 0.10 0.08 0.06 0.11 0.13 e5-large-v2 0.06 0.07 0.08 0.06 0.09 0.22 1.00 0.22 0.04 0.08 0.07 0.04 0.05 0.07 0.05 0.03 0.07 0.06
e5-small-v2 0.10 0.10 0.09 0.09 0.11 0.18 0.16 1.00 0.07 0.08 0.09 0.07 0.08 0.09 0.07 0.07 0.07 0.08 e5-small-v2 0.04 0.04 0.05 0.03 0.05 0.16 0.22 1.00 0.02 0.05 0.05 0.04 0.04 0.05 0.05 0.03 0.04 0.05 0.6
0.6
gte-base 0.14 0.21 0.26 0.19 0.14 0.15 0.10 0.07 1.00 0.29 0.24 0.12 0.11 0.25 0.12 0.10 0.15 0.18 gte-base 0.09 0.13 0.24 0.12 0.12 0.09 0.04 0.02 1.00 0.23 0.21 0.08 0.08 0.15 0.09 0.05 0.12 0.17
gte-large 0.12 0.35 0.20 0.28 0.14 0.13 0.09 0.08 0.29 1.00 0.27 0.12 0.10 0.40 0.14 0.10 0.17 0.19 gte-large 0.07 0.21 0.19 0.14 0.18 0.12 0.08 0.05 0.23 1.00 0.28 0.07 0.07 0.23 0.10 0.06 0.12 0.13
gte-small 0.12 0.22 0.16 0.17 0.24 0.13 0.09 0.09 0.24 0.27 1.00 0.10 0.10 0.24 0.12 0.08 0.16 0.17 gte-small 0.08 0.14 0.16 0.12 0.20 0.11 0.07 0.05 0.21 0.28 1.00 0.07 0.08 0.17 0.10 0.07 0.11 0.12
0.4
gtr-t5-base 0.12 0.12 0.10 0.12 0.11 0.10 0.10 0.07 0.12 0.12 0.10 1.00 0.27 0.12 0.16 0.12 0.14 0.14 0.4 gtr-t5-base 0.09 0.12 0.09 0.13 0.11 0.09 0.04 0.04 0.08 0.07 0.07 1.00 0.26 0.13 0.10 0.09 0.09 0.10
gtr-t5-large 0.15 0.11 0.10 0.11 0.10 0.10 0.11 0.08 0.11 0.10 0.10 0.27 1.00 0.13 0.13 0.14 0.16 0.15 gtr-t5-large 0.11 0.11 0.11 0.13 0.09 0.10 0.05 0.04 0.08 0.07 0.08 0.26 1.00 0.12 0.08 0.08 0.10 0.14
mxbai-embed-large-v1 0.12 0.64 0.20 0.49 0.15 0.13 0.10 0.09 0.25 0.40 0.24 0.12 0.13 1.00 0.15 0.12 0.16 0.18 mxbai-embed-large-v1 0.08 0.71 0.24 0.51 0.23 0.12 0.07 0.05 0.15 0.23 0.17 0.13 0.12 1.00 0.12 0.09 0.09 0.11
sentence-t5-base 0.11 0.13 0.11 0.12 0.08 0.09 0.08 0.07 0.12 0.14 0.12 0.16 0.13 0.15 1.00 0.21 0.13 0.13 sentence-t5-base 0.05 0.10 0.11 0.09 0.09 0.08 0.05 0.05 0.09 0.10 0.10 0.10 0.08 0.12 1.00 0.23 0.07 0.07
0.2
sentence-t5-large 0.11 0.09 0.07 0.10 0.06 0.08 0.06 0.07 0.10 0.10 0.08 0.12 0.14 0.12 0.21 1.00 0.11 0.12 0.2 sentence-t5-large 0.05 0.08 0.10 0.07 0.05 0.05 0.03 0.03 0.05 0.06 0.07 0.09 0.08 0.09 0.23 1.00 0.06 0.07
text-embedding-3-large 0.25 0.15 0.11 0.15 0.10 0.14 0.11 0.07 0.15 0.17 0.16 0.14 0.16 0.16 0.13 0.11 1.00 0.31 text-embedding-3-large 0.18 0.09 0.12 0.09 0.09 0.13 0.07 0.04 0.12 0.12 0.11 0.09 0.10 0.09 0.07 0.06 1.00 0.29
text-embedding-3-small 0.18 0.17 0.14 0.17 0.10 0.15 0.13 0.08 0.18 0.19 0.17 0.14 0.15 0.18 0.13 0.12 0.31 1.00 text-embedding-3-small 0.18 0.10 0.15 0.11 0.09 0.15 0.06 0.05 0.17 0.13 0.12 0.10 0.14 0.11 0.07 0.07 0.29 1.00
SFR-Embedding-Mistral
UAE-Large-V1
SFR-Embedding-Mistral
bge-base-en-v1.5
bge-large-en-v1.5
bge-small-en-v1.5
UAE-Large-V1
e5-base-v2
e5-large-v2
e5-small-v2
gte-base
gte-large
gte-small
mxbai-embed-large-v1
gtr-t5-base
gtr-t5-large
sentence-t5-base
sentence-t5-large
text-embedding-3-large
text-embedding-3-small
bge-base-en-v1.5
bge-large-en-v1.5
bge-small-en-v1.5
e5-base-v2
e5-large-v2
e5-small-v2
gte-base
gte-large
gte-small
mxbai-embed-large-v1
text-embedding-3-large
text-embedding-3-small
gtr-t5-base
gtr-t5-large
sentence-t5-base
sentence-t5-large
(a) (b)
Figure 6: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on FiQA-2018 (a) and TREC-COVID (b). Most
models seem to retrieve almost completely distinct text chunks. Only the bge/UAE/mxbai cluster still shows a notable level of similarity,
while the scores of the remaining clusters indicate only moderate to low levels of similarity.
Our analysis demonstrates that although models tend lower similarity, especially on larger datasets. Although we
to be most similar to models from their own family, inter- were able to identify some model clusters, our results sug-
family clusters exist. The most prominent of these clusters gest that choosing the optimal model remains a non-trivial
is formed by the models bge-large-en-v1.5, UAE-Large-V1 task that requires careful consideration.
and mxbai-embed-large-v1, which demonstrate high sim- Still, we argue that a better understanding of how sim-
ilarity even for retrieval at low π. Nevertheless, the high ilarly different embedding models behave is an important
variance of retrieval similarity of the remaining clusters research topic that requires further attention. There are a
for small π means that while the identified clusters might plethora of real-world scenarios where RAG systems can
provide some measure of orientation when choosing an em- potentially be deployed. One such scenario, for example,
bedding model, the choice still remains a non-trivial task. is to retrieve relevant web content in response to a query.
Identifying suitable alternatives to proprietary models is As large corpora of such data are available in the form of
likewise not as simple. While we were able to determine Web ARChive (WARC) files, evaluating embedding model
SFR-Embedding-Mistral as the model being most similar to similarity on such large, uncleaned datasets would perhaps
OpenAIβs embedding models, Jaccard similarity at top-10 lead to a better estimation of model similarity for a realistic
for larger datasets shows a low overlap in retrieved text RAG use case. Additionally, as documents often need to
chunks. Furthermore, for Cohereβs embedding model, we be chunked into smaller parts to fit into the models, the
were unable to find a single most similar model, as this effect of chunking strategies such as token-based or seman-
model varied across datasets for small π values. tic chunking on embedding similarity could be explored.
Furthermore, our evaluation focused on a small sample of
similarity measures, with their own definition about which
7. Conclusion criteria make models similar. Introducing more measures
with different perspectives could improve our understand-
In this paper we evaluated the similarity of embedding mod-
ing on which factors influence model similarity. Finally,
els on different datasets. Given the large number of available
our focus was on identifying clusters or families of models,
models, identifying clusters or families of models with sim-
which for our initial experiments led us to choosing families
ilar embeddings can simplify the model selection process.
of embedding models with varying performance on MTEB.
While previous work on LLM similarity exists, to the best
With the frequent appearance of new models on the leader-
of the authorsβ knowledge, it so far lacks a clear focus on
board and the focus on good MTEB performance, it would
embedding models specifically in the context of RAG. We
be of interest to compare the best performing models on
therefore analyzed the similarity of embeddings generated
MTEB and check if their relative difference in performance
by 19 different models using CKA for pairwise comparison
correlates with how similar these models are.
as well as Jaccard and rank similarity to compare retrieval
behavior at top-π across five datasets. Comparing embed-
dings with CKA generally showed intra- and inter-family Acknowledgments
clusters across datasets. These clusters also appeared when
evaluating top-π retrieval similarity with large π values. This work has received funding from the European Unionβs
However, scores for low π values, which would commonly Horizon Europe research and innovation program under
be chosen in RAG systems, show high variance and much grant agreement No 101070014 (OpenWebSearch.EU, https:
//doi.org/10.3030/101070014).
References Learning Research, PMLR, 2019, pp. 3519β3529. URL:
https://proceedings.mlr.press/v97/kornblith19a.html.
[1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, [15] Y. Bansal, P. Nakkiran, B. Barak, Revisiting model
Y. J. Bang, A. Madotto, P. Fung, Survey of hallucina- stitching to compare neural representations, in:
tion in natural language generation, ACM Computing M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang,
Surveys 55 (2023) 1β38. J. W. Vaughan (Eds.), Advances in Neural Infor-
[2] S. M. Mousavi, S. Alghisi, G. Riccardi, Is your mation Processing Systems, volume 34, Curran
llm outdated? benchmarking llms & alignment Associates, Inc., 2021, pp. 225β236. URL: https:
algorithms for time-sensitive knowledge, 2024. //proceedings.neurips.cc/paper_files/paper/2021/file/
arXiv:2404.08700. 01ded4259d101feb739b06c399e9cd9c-Paper.pdf.
[3] N. Muennighoff, N. Tazi, L. Magne, N. Reimers, [16] K. Lenc, A. Vedaldi, Understanding image representa-
Mteb: Massive text embedding benchmark, 2023. tions by measuring their equivariance and equivalence,
arXiv:2210.07316. in: Proceedings of the IEEE conference on computer
[4] M. Klabunde, T. Schumacher, M. Strohmaier, F. Lem- vision and pattern recognition, 2015, pp. 991β999.
merich, Similarity of neural network models: A survey [17] A. Balogh, M. Jelasity, On the functional similar-
of functional and representational measures, arXiv ity of robust and non-robust neural representations,
preprint arXiv:2305.06329 (2023). in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt,
[5] M. Klabunde, M. B. Amor, M. Granitzer, F. Lem- S. Sabato, J. Scarlett (Eds.), Proceedings of the 40th
merich, Towards measuring representational sim- International Conference on Machine Learning, vol-
ilarity of large language models, arXiv preprint ume 202 of Proceedings of Machine Learning Research,
arXiv:2312.02730 (2023). PMLR, 2023, pp. 1614β1635. URL: https://proceedings.
[6] M. Raghu, J. Gilmer, J. Yosinski, J. Sohl-Dickstein, mlr.press/v202/balogh23a.html.
Svcca: Singular vector canonical correlation analy- [18] M. Milani Fard, Q. Cormier, K. Canini, M. Gupta,
sis for deep learning dynamics and interpretability, Launch and iterate: Reducing prediction churn, Ad-
Advances in neural information processing systems vances in Neural Information Processing Systems 29
30 (2017). (2016).
[7] A. Morcos, M. Raghu, S. Bengio, Insights on represen- [19] X. Xie, L. Ma, H. Wang, Y. Li, Y. Liu, X. Li, Diffchaser:
tational similarity in neural networks with canonical Detecting disagreements for deep neural networks, In-
correlation, Advances in neural information process- ternational Joint Conferences on Artificial Intelligence
ing systems 31 (2018). Organization, 2019.
[8] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical [20] Y. Li, Z. Zhang, B. Liu, Z. Yang, Y. Liu, Modeldiff:
correlation analysis: An overview with application to testing-based dnn similarity comparison for model
learning methods, Neural computation 16 (2004) 2639β reuse detection, in: Proceedings of the 30th ACM
2664. SIGSOFT International Symposium on Software Test-
[9] F. Ding, J.-S. Denain, J. Steinhardt, Grounding rep- ing and Analysis, ISSTA β21, ACM, 2021. URL: http:
resentation similarity through statistical testing, Ad- //dx.doi.org/10.1145/3460319.3464816. doi:10.1145/
vances in Neural Information Processing Systems 34 3460319.3464816.
(2021) 1556β1568. [21] J. M. Wu, Y. Belinkov, H. Sajjad, N. Durrani, F. Dalvi,
[10] M. Zullich, F. Pellegrino, E. Medvet, A. Ansuini, et al., J. Glass, Similarity analysis of contextual word repre-
On the similarity between hidden layers of pruned and sentation models, 2020. arXiv:2005.01172.
unpruned convolutional neural networks, in: Proceed- [22] M. Freestone, S. K. K. Santu, Word embeddings
ings of the 9th International Conference on Pattern revisited: Do llms offer something new?, 2024.
Recognition Applications and Methods, Scitepress, arXiv:2402.11094.
2020, pp. 52β59. [23] D. Brown, C. Godfrey, N. Konz, J. Tu, H. Kvinge,
[11] W. Chen, Z. Miao, Q. Qiu, Inner product-based neural Understanding the inner workings of language
network similarity, Advances in Neural Information models through representation dissimilarity, 2023.
Processing Systems 36 (2024). arXiv:2310.14993.
[12] Y. Li, J. Yosinski, J. Clune, H. Lipson, J. Hopcroft, [24] N. Thakur, N. Reimers, A. RΓΌcklΓ©, A. Srivastava,
Convergent learning: Do different neural net- I. Gurevych, Beir: A heterogenous benchmark for zero-
works learn the same representations?, 2016. shot evaluation of information retrieval models, 2021.
arXiv:1511.07543. arXiv:2104.08663.
[13] Y. Li, J. Yosinski, J. Clune, H. Lipson, J. Hopcroft, Con- [25] P. Finardi, L. Avila, R. Castaldoni, P. Gengo, C. Larcher,
vergent learning: Do different neural networks learn M. Piau, P. Costa, V. CaridΓ‘, The chronicles of rag:
the same representations?, in: D. Storcheus, A. Ros- The retriever, the chunk and the generator, 2024.
tamizadeh, S. Kumar (Eds.), Proceedings of the 1st arXiv:2401.07883.
International Workshop on Feature Extraction: Mod- [26] K. Sawarkar, A. Mangal, S. R. Solanki, Blended rag:
ern Questions and Challenges at NIPS 2015, volume 44 Improving rag (retriever-augmented generation) ac-
of Proceedings of Machine Learning Research, PMLR, curacy with semantic search and hybrid query-based
Montreal, Canada, 2015, pp. 196β212. URL: https: retrievers, 2024. arXiv:2404.07220.
//proceedings.mlr.press/v44/li15convergent.html. [27] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ragas:
[14] S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Sim- Automated evaluation of retrieval augmented genera-
ilarity of neural network representations revisited, tion, 2023. arXiv:2309.15217.
in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceed- [28] A. Gretton, O. Bousquet, A. Smola, B. SchΓΆlkopf, Mea-
ings of the 36th International Conference on Ma- suring statistical dependence with hilbert-schmidt
chine Learning, volume 97 of Proceedings of Machine norms, in: S. Jain, H. U. Simon, E. Tomita (Eds.), Algo-
rithmic Learning Theory, Springer Berlin Heidelberg,
Berlin, Heidelberg, 2005, pp. 63β77.
[29] C. Wang, W. Rao, W. Guo, P. Wang, J. Liu, X. Guan,
Towards understanding the instability of network em-
bedding, IEEE Transactions on Knowledge and Data
Engineering 34 (2022) 927β941. doi:10.1109/TKDE.
2020.2989512.
[30] C. Inc., Chroma, Chroma Homepage, 2024. URL: https:
//docs.trychroma.com/.
[31] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang,
R. Majumder, F. Wei, Text embeddings by weakly-
supervised contrastive pre-training, arXiv preprint
arXiv:2212.03533 (2022).
[32] J. Ni, G. H. Γbrego, N. Constant, J. Ma, K. B. Hall,
D. Cer, Y. Yang, Sentence-t5: Scalable sentence en-
coders from pre-trained text-to-text models, 2021.
arXiv:2108.08877.
[33] J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Γbrego, J. Ma, V. Y.
Zhao, Y. Luan, K. B. Hall, M.-W. Chang, Y. Yang,
Large dual encoders are generalizable retrievers, 2021.
arXiv:2112.07899.
[34] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, C-pack:
Packaged resources to advance general chinese em-
bedding, 2023. arXiv:2309.07597.
[35] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang,
Towards general text embeddings with multi-stage
contrastive learning, arXiv preprint arXiv:2308.03281
(2023).
[36] OpenAI, New embedding models with lower pric-
ing, OpenAI Blog, 2024. URL: https://openai.com/blog/
new-embedding-models-and-api-updates.
[37] Cohere, Embeddings - text embeddings with advanced
language models, Cohere Homepage, 2024. URL: https:
//cohere.com/embeddings.
[38] S. Lee, A. Shakir, D. Koenig, J. Lipp, Open source strikes
bread - new fluffy embeddings model, 2024. URL: https:
//www.mixedbread.ai/blog/mxbai-embed-large-v1.
[39] X. Li, J. Li, Angle-optimized text embeddings, arXiv
preprint arXiv:2309.12871 (2023).
[40] R. Meng, Y. Liu, S. R. Joty, C. Xiong, Y. Zhou,
S. Yavuz, Sfr-embedding-mistral:enhance text re-
trieval with transfer learning, Salesforce AI Research
Blog, 2024. URL: https://blog.salesforceairesearch.com/
sfr-embedded-mistral/.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
napeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-
learn: Machine learning in Python, Journal of Machine
Learning Research 12 (2011) 2825β2830.
[42] M. L. Waskom, seaborn: statistical data visualiza-
tion, Journal of Open Source Software 6 (2021)
3021. URL: https://doi.org/10.21105/joss.03021. doi:10.
21105/joss.03021.