=Paper= {{Paper |id=Vol-3784/short4 |storemode=property |title=Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems |pdfUrl=https://ceur-ws.org/Vol-3784/short4.pdf |volume=Vol-3784 |authors=Laura Caspari,Kanishka Ghosh Dastidar,Saber Zerhoudi,Jelena Mitrovic,Michael Granitzer |dblpUrl=https://dblp.org/rec/conf/ir-rag/CaspariDZMG24 }} ==Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems== https://ceur-ws.org/Vol-3784/short4.pdf
                         Beyond Benchmarks: Evaluating Embedding Model Similarity
                         for Retrieval Augmented Generation Systems
                         Laura Caspari1,* , Kanishka Ghosh Dastidar1 , Saber Zerhoudi1 , Jelena Mitrovic1 and Michael Granitzer1
                         1
                             University of Passau, Passau, Germany


                                           Abstract
                                           The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer
                                           volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark
                                           performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding
                                           models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings
                                           on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between
                                           these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across
                                           five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models
                                           corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-π‘˜ retrieval similarity
                                           reveals high-variance at low π‘˜ values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting
                                           the highest similarity to OpenAI models.

                                           Keywords
                                           Large language model, Retrieval-augmented generation, Model similarity



                         1. Motivation                                                                                                perspective of comparing a single scalar value on an ar-
                                                                                                                                      ray of downstream tasks, such a view of model similarity
                         Retrieval-Augmented Generation (RAG) is an emerging                                                          might overlook the nuances of the relative behaviour of
                         paradigm that helps mitigate the problems of factual hallu-                                                  the models [4]. As an example, the absolute difference in
                         cination [1] and outdated training data [2] of large language                                                precision@k between two retrieval systems only provides a
                         models (LLMs) by providing these models with access to                                                       weak indication of the overlap of retrieved results. We argue
                         an external, non-parametric knowledge source (e.g. a doc-                                                    that identifying clusters of models with similar behaviour
                         ument corpus). Central to the functioning of RAG frame-                                                      would allow practitioners to construct smaller, yet diverse
                         works is the retrieval step, wherein a small subset of can-                                                  candidate pools of models to evaluate. Beyond model selec-
                         didate documents is retrieved from the document corpus,                                                      tion, as highlighted by Klabunde et al., [5], such an analysis
                         specific to the input query or prompt. This retrieval pro-                                                   also facilitates the identification of common factors con-
                         cess, known as dense-retrieval, hinges on text embeddings.                                                   tributing to strong performance, easier model ensembling,
                         Typically, the generation of these embeddings is assigned                                                    and detection of potential instances of unauthorized model
                         to an LLM, for which there are several options due to the                                                    reuse.
                         rapid evolution of the field. Consequently, selecting the                                                       In this paper, we analyze different LLMs in terms of the
                         most suitable embedding model from an array of available                                                     similarities of the embeddings they generate. Our similarity
                         choices emerges as a critical aspect in the development of                                                   analysis serves as an unsupervised evaluation framework
                         RAG systems. The information to guide this choice is cur-                                                    for these embedding models, in contrast to performance
                         rently primarily limited to architectural details (which are                                                 benchmarks that require labelled data. We do this from a
                         also on occasion scarce due to the prevalence of closed mod-                                                 dual perspective - we directly compare the embeddings us-
                         els) and performance benchmarks such as the Massive Text                                                     ing representational similarity measures. Additionally, we
                         Embedding Benchmark (MTEB) [3].                                                                              evaluate model similarity specifically in terms of their func-
                            We posit that an analysis of the similarity of the embed-                                                 tional impact on RAG systems i.e. we look at how similar
                         dings generated by these models would significantly aid                                                      the retrieved results are. Our evaluation focuses on sev-
                         this model selection process. Given the large number of                                                      eral prominent model families, to analyze similarities both
                         candidates and ever increasing scale of the models, a from-                                                  within and across them. We also compare proprietary mod-
                         scratch empirical evaluation of the embedding quality of                                                     els (such as those by OpenAI or Cohere) to open-sourced
                         these LLMs on a particular task can incur significant costs.                                                 ones in order to identify the most similar alternatives. Our
                         This challenge becomes especially pronounced when deal-                                                      experiments are carried out on five popular benchmark
                         ing with large-scale corpora comprising potentially millions                                                 datasets to determine if similarities between models are
                         of documents. While the relative performance scores of                                                       influenced by the choice of data. Our code is available at
                         these models on benchmark datasets offer the simplified                                                      https://github.com/casparil/embedding-model-similarity.

                          IR-RAG@SIGIR’24: ACM SIGIR Workshop on Information Retrieval’s Role
                          in RAG Systems, July 18, 2024, Washington D.C., USA                                                         2. Related Work
                         *
                           Corresponding author.
                          $ laura.caspari@uni-passau.de (L. Caspari);                                                                 Studies evaluating similarities of neural networks fall into
                          kanishka.ghoshdastidar@uni-passau.de (K. G. Dastidar);                                                      two main categories: the first involves comparing activa-
                          saber.zerhoudi@uni-passau.de (S. Zerhoudi);                                                                 tions of different models generated at any pair of layers for a
                          michael.granitzer@uni-passau.de (J. Mitrovic);
                          jelena.mitrovic@uni-passau.de (M. Granitzer)
                                                                                                                                      specific input (representational similarity), while the second
                           0009-0002-6670-3211 (L. Caspari); 0000-0003-4171-0597                                                     compares the model outputs (functional similarity). Raghu
                          (K. G. Dastidar); 0000-0003-2259-0462 (S. Zerhoudi);                                                        et al. [6] and Morcos et al. [7] propose measures building
                          0000-0003-3220-8749 (J. Mitrovic); 0000-0003-3566-5507 (M. Granitzer)                                       on Canonical Correlation Analysis (CCA) [8], a statistical
                                       Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1                                                           3. Methods
The datasets used for generating embeddings with their number
of queries and corpus size.                                       We evaluate embedding model similarity using two ap-
                                                                  proaches. The first directly compares the embeddings of text
             Dataset Name     Queries    Corpus                   chunks generated by the models. The second approach is
             TREC-COVID          50       171k                    specific to the RAG context, where we evaluate the similar-
             NFCorpus           323        3.6k                   ity of retrieved results for a given query. These approaches
             FiQA-2018          648        57k                    are discussed in detail in the following sections.
             ArguAna            1406      8.67k
             SciFact            300         5k
                                                                  3.1. Pair-wise Embedding Similarity
                                                                  There are several metrics defined in the literature that mea-
technique used to find the linear relationship between two        sure representational similarity [4]. Many of these metrics
sets of variables by maximizing their correlation. Such com-      require the representation spaces of the embeddings to be
parisons using CCA or variants thereof can be found in            compared to be aligned and/or the dimensionality of the em-
several works [9], [10], [11]. Beyond CCA-based measures,         beddings across the models to be identical. To avoid these
other works have also explored computing correlations [12]        constraints, we pick Centered Kernel Alignment (CKA) [14]
and the mutual information [13] between neurons across            with a linear kernel as our similarity measure.
networks. Kornblith et al. [14] propose Centered Kernel              The measure computes similarity between two sets of
Alignment (CKA), which they show improves over several            embeddings in two steps. First, for a set of embeddings,
similarity measures in identifying corresponding layers of        the pair-wise similarity scores between all entries within
identical networks with different initializations. A diverse      this set are computed using the kernel function. Thus, row
range of functional similarity evaluations have also been         k of the resulting similarity matrix contains entries repre-
explored in the literature. A few examples include model-         senting the similarity between embedding k and all other
stitching [15], [16], [17], disagreement measures between         embeddings, including itself. Computing two such embed-
output classes [18], [19], and quantifying the similarity be-     ding similarity matrices for different models with the same
tween the class-wise output probabilities [20]. We would          number of embeddings then leads to two matrices E and
point the reader to the survey by Klabunde et al. [4] for a de-   E’ of matching dimensions. These are compared directly
tailed overview of representational and functional similarity     in the second step with the Hilbert-Schmidt Independence
measures.                                                         Criterion (HSIC) [28] using the following formula:
   Recently, a few works have also focused on specifically
                                                                                                    𝐻𝑆𝐼𝐢(𝐸,𝐸 β€² )
evaluating the similarity of LLMs. While Wu et al. [21]                  𝐢𝐾𝐴(𝐸, 𝐸 β€² ) = √
                                                                                               𝐻𝑆𝐼𝐢(𝐸,𝐸)𝐻𝑆𝐼𝐢(𝐸 β€² ,𝐸 β€² )
evaluate language models along several perspectives, such
as their representational and neuron-level similarities, their       The resulting similarity scores are bounded in the interval
evaluation pre-dates the introduction of the recent wave          [0, 1] with a score of 1 indicating equivalent representations.
of large scale models. Freestone and Santu [22] consider          CKA assumes that representations are mean-centered.
similarities of word embeddings, and evaluate if LLMs dif-
fer significantly to classical encoding models in terms of        3.2. Retrieval Similarity
their representations. The works by Klabunde et al. [5]
and Brown et al. [23] are more recent, and evaluate the           While a pair-wise comparison of embeddings offers insights
representational similarity of LLMs, with the latter also con-    into the similarities of the representations learned by these
sidering the similarities between models of different sizes       models, it does not suffice to quantify the similarities in
in the same model family.                                         outcomes when these embedding models are deployed for
   Much of the literature on evaluation of LLM embeddings         specific tasks. Therefore, in context of RAG systems, we
focuses on their performance on downstream tasks, with            consider the similarity of retrieved text chunks for a given
benchmarks such as BEIR [24] (for retrieval specifically) and     query, when different embedding models are used. As a
MTEB [3] providing a unified view of embedding quality            first step, for a given dataset, we generate embeddings of
across metrics and datasets. The metrics used here mostly         queries and document chunks with each of the embedding
include typical information retrieval metrics such as pre-        models. We then retrieve the π‘˜ most similar embeddings
cision, recall, and mean reciprocal rank at certain cutoffs.      in terms of the cosine similarity for a particular query. As
Some works specifically evaluate the retrieval components         these embeddings correspond to specific chunks of text, we
in a RAG context, where they either use a dataset outside         derive the sets of retrieved chunks C and C’ for a pair of
of those included in the benchmarks [25] or where the eval-       models. To measure the similarity of these sets, we use the
uation encompasses other aspects of the retriever beyond          Jaccard similarity coefficient as follows:
the embedding model being used [26]. Another approach,                                                       β€²
that does not rely on ground-truth labels, is given by the Re-                    π½π‘Žπ‘π‘π‘Žπ‘Ÿπ‘‘(𝐢, 𝐢 β€² ) = |𝐢∩𝐢   |
                                                                                                     |𝐢βˆͺ𝐢 β€² |
trieval Augmented Generation Assessment (RAGAS) frame-               Here, |𝐢 ∩ 𝐢 β€² | corresponds to the overlap in text chunks
work, which uses an LLM to determine the ratio of sentences       by counting how often the two models retrieved the same
in the retrieved context that are relevant to the answer be-      chunks. Similarly, we can compute the union |𝐢 βˆͺ 𝐢 β€² |,
ing generated [27]. To the best of our knowledge, there are       which corresponds to all retrieved text chunks, counting
no works that evaluate the similarity of embedding models         chunks present in both sets only once. The resulting score
from a retrieval perspective.                                     is bounded in the interval [0, 1] with 1 indicating that both
                                                                  models retrieved the same set of text chunks.
                                                                     While Jaccard similarity computes the percentage to
                                                                  which two sets overlap, it ignores the order in the sets. Rank
    Table 2
    We compare a diverse set of open source models from different families as well as proprietary models with varying performance
    on MTEB.
              Model                         Embedding dimension       Max. Tokens      MTEB Average      Open Source
              SFR-Embedding-Mistral                  4096                 32768             67.56              βœ“
              mxbai-embed-large-v1                   1024                  512              64.68              βœ“
              UAE-Large-V1                           1024                  512              64.64              βœ“
              text-embedding-3-large                 3072                 8191              64.59              βœ—
              Cohere embed-english-v3.0              1024                  512              64.47              βœ—
              bge-large-en-v1.5                      1024                  512              64.23              βœ“
              bge-base-en-v1.5                       768                   512              63.55              βœ“
              gte-large                              1024                  512              63.13              βœ“
              gte-base                               768                   512              62.39              βœ“
              text-embedding-3-small                 1536                 8191              62.26              βœ—
              e5-large-v2                            1024                  512              62.25              βœ“
              bge-small-en-v1.5                      384                   512              62.17              βœ“
              e5-base-v2                             768                   512              61.5               βœ“
              gte-small                              384                   512              61.36              βœ“
              e5-small-v2                            384                   512              59.93              βœ“
              gtr-t5-large                           768                   512              58.28              βœ“
              sentence-t5-large                      768                   512              57.06              βœ“
              gtr-t5-base                            768                   512              56.19              βœ“
              sentence-t5-base                       768                   512              55.27              βœ“



similarity [29], on the other hand, considers the order of            [30], an open source embedding database. For each vector,
common elements, with closer elements having a higher                 we additionally store information about the document and
impact on the score. The measure assigns ranks to common              text chunk ids it encodes to be able to match embeddings
text chunks according to their similarity to the query, i.e.          generated by different models for evaluation.
π‘ŸπΆ (𝑗) = 𝑛 if chunk 𝑗 was the top-𝑛 retrieved result for the             For model selection, we primarily use publicly available
query. Ranks are then compared using:                                 models from the MTEB leaderboard [3]. We do not simply
                                                                      pick the best performing models on the leaderboard; instead,
                                                                      our choices are influenced by several factors. Firstly, we
                                          2
π‘…π‘Žπ‘›π‘˜(π‘ŸπΆ (𝑗), π‘ŸπΆ β€² (𝑗)) = (1+|π‘ŸπΆ (𝑗)βˆ’π‘Ÿ β€² (𝑗)|)(π‘Ÿ
                                        𝐢       𝐢 (𝑗)+π‘Ÿ β€² (𝑗))
                                                       𝐢
                                                                      focus on analyzing similarities within and across model
                                                                      families and pick models belonging to the e5 [31], t5 [32, 33],
  With this, rank similarity for two sets of retrieved text           bge [34], and gte [35] families. Secondly, we recognize
chunks C, C’ is calculated as:                                        that it might be of interest to users to avoid pay-by-token
                                                                      policies of proprietary models by identifying similar open-
                                                                      source alternatives. Therefore, we pick high-performing
             π‘…π‘Žπ‘›π‘˜π‘†π‘–π‘š(𝐢, 𝐢 β€² ) = 𝐻(|𝐢∩𝐢
                                    1
                                                                      proprietary models, two from OpenAI (text-embedding-3-
                                        β€² |)


                                                                      large and -small) [36] and one from Cohere (Cohere embed-
               βˆ‘οΈ
                    π‘…π‘Žπ‘›π‘˜(π‘ŸπΆ (𝑗), π‘ŸπΆ β€² (𝑗))
               π‘—βˆˆ|𝐢∩𝐢 β€² |                                             english-v3.0) [37]. We also compare the mxbai-embed-large-
                                                                      v1 (mxbai) [38] and UAE-Large-V1 (UAE) [39] models, that
                          βˆ‘οΈ€πΎ=|𝐢∩𝐢 β€² | 1                              not only report very similar performances on MTEB, but also
  with 𝐻(|𝐢 ∩ 𝐢 β€² |) = π‘˜=1              π‘˜
                                          denoting the K-th
harmonic number, normalizing the score. Like the other                identical embedding dimensions, model size and memory
measures, rank similarity is bounded in the interval [0, 1]           usage. Finally, we include SFR-Embedding-Mistral (Mistral)
with 1 indicating that all ranks are identical.                       [40] as the best-performing model on the leaderboard at the
                                                                      time of our experiments. A detailed overview of all selected
                                                                      models can be seen in Table 2.
4. Experimental Setup                                                    To compare embedding similarity across models and
                                                                      datasets, we employ different strategies depending on the
The following paragraphs describe our choice of datasets              similarity measure. We apply CKA by retrieving all em-
and models, along with details of the implementation of our           beddings created by a model, matching embeddings using
experiments.                                                          their document and text chunk ids and then computing
   As we focus on the retrieval component of RAG sys-                 their similarity for each of the five datasets. For Jaccard
tems, we select five publicly available datasets from the             and rank similarity, we use sklearn’s NearestNeighbor class
BEIR benchmark [24]. As generating embeddings for large               [41] to determine the the top-π‘˜ retrieval results. We com-
datasets is a time-intensive process, especially for a larger         pute Jaccard and rank scores per dataset, averaging over 25
number of models, we opt for five of the smaller datasets             queries. For the NFCorpus dataset, we calculate retrieval
from the benchmark. This approach allows us to compare                similarity for all possible π‘˜, i.e. using all embeddings gen-
embeddings generated by a variety of models while at the              erated for the dataset. As calculating similarity for each
same time allowing us to evaluate embedding similarity ac-            possible π‘˜ is computationally expensive, we did not repeat
cross datasets. An overview of the datasets is shown in Table         this for the remaining datasets and chose a smaller π‘˜ value
1. For each dataset, we create embeddings by splitting docu-          instead. Furthermore, as only a limited number of results
ments into text chunks such that each chunk contains 256              are to be provided as context to the generative model, ana-
tokens. The embedding vectors are stored with Chroma DB
    1.0                                                                                                                                  gte-large_vs_SFR-Embedding-Mistral                   gte-large_vs_gte-base
                                                                                                                                         gte-large_vs_UAE-Large-V1                            gte-large_vs_gte-small
    0.9                                                                                                                                  gte-large_vs_bge-base-en-v1.5                        gte-large_vs_gtr-t5-base
                                                                                                                                         gte-large_vs_bge-large-en-v1.5                       gte-large_vs_gtr-t5-large
    0.8                                                                                                                                  gte-large_vs_bge-small-en-v1.5                       gte-large_vs_mxbai-embed-large-v1
                                                                                                                                         gte-large_vs_e5-base-v2                              gte-large_vs_sentence-t5-base
    0.7                                                                                                                                  gte-large_vs_e5-large-v2                             gte-large_vs_sentence-t5-large
           1.00 0.81 0.64 0.63 0.64 0.64 0.61 0.63 0.65 0.62 0.71 0.66 0.67 0.68 0.70 0.63 0.67 0.66 0.63 gtr-t5-base                    gte-large_vs_e5-small-v2                             gte-large_vs_text-embedding-3-large
           0.81 1.00 0.67 0.67 0.67 0.66 0.64 0.66 0.68 0.64 0.70 0.74 0.72 0.72 0.74 0.68 0.71 0.69 0.66 gtr-t5-large                   gte-large_vs_embed-english-v3.0                      gte-large_vs_text-embedding-3-small
           0.64 0.67 1.00 0.99 0.98 0.86 0.85 0.93 0.88 0.86 0.73 0.73 0.78 0.80 0.80 0.76 0.79 0.78 0.74 mxbai-embed-large-v1
                                                                                                                                   0.6
           0.63 0.67 0.99 1.00 0.99 0.84 0.82 0.90 0.86 0.83 0.71 0.72 0.76 0.78 0.78 0.75 0.78 0.77 0.73 UAE-Large-V1
           0.64 0.67 0.98 0.99 1.00 0.84 0.81 0.89 0.85 0.82 0.71 0.72 0.76 0.78 0.78 0.76 0.79 0.77 0.74 bge-large-en-v1.5
                                                                                                                                                                                     0.6
           0.64 0.66 0.86 0.84 0.84 1.00 0.93 0.86 0.86 0.85 0.72 0.71 0.74 0.76 0.77 0.72 0.76 0.75 0.75 bge-small-en-v1.5
           0.61 0.64 0.85 0.82 0.81 0.93 1.00 0.90 0.86 0.91 0.72 0.71 0.75 0.77 0.78 0.71 0.76 0.74 0.73 gte-small                0.5
           0.63 0.66 0.93 0.90 0.89 0.86 0.90 1.00 0.89 0.92 0.74 0.74 0.78 0.81 0.81 0.75 0.81 0.76 0.73 gte-large
                                                                                                                                                                                     0.4
           0.65 0.68 0.88 0.86 0.85 0.86 0.86 0.89 1.00 0.94 0.73 0.72 0.77 0.80 0.81 0.76 0.81 0.79 0.74 bge-base-en-v1.5
           0.62 0.64 0.86 0.83 0.82 0.85 0.91 0.92 0.94 1.00 0.72 0.71 0.77 0.79 0.80 0.73 0.79 0.76 0.72 gte-base                                                                   0.2
           0.71 0.70 0.73 0.71 0.71 0.72 0.72 0.74 0.73 0.72 1.00 0.87 0.71 0.74 0.76 0.70 0.73 0.71 0.70 sentence-t5-base         0.4
           0.66 0.74 0.73 0.72 0.72 0.71 0.71 0.74 0.72 0.71 0.87 1.00 0.73 0.76 0.77 0.71 0.75 0.72 0.69 sentence-t5-large                                                          0.00    10     20       30       40      50
           0.67 0.72 0.78 0.76 0.76 0.74 0.75 0.78 0.77 0.77 0.71 0.73 1.00 0.87 0.84 0.76 0.79 0.76 0.73 SFR-Embedding-Mistral
           0.68 0.72 0.80 0.78 0.78 0.76 0.77 0.81 0.80 0.79 0.74 0.76 0.87 1.00 0.90 0.78 0.81 0.77 0.74 text-embedding-3-large
                                                                                                                                   0.3
           0.70 0.74 0.80 0.78 0.78 0.77 0.78 0.81 0.81 0.80 0.76 0.77 0.84 0.90 1.00 0.78 0.82 0.78 0.75 text-embedding-3-small
           0.63 0.68 0.76 0.75 0.76 0.72 0.71 0.75 0.76 0.73 0.70 0.71 0.76 0.78 0.78 1.00 0.93 0.83 0.79 e5-large-v2
           0.67 0.71 0.79 0.78 0.79 0.76 0.76 0.81 0.81 0.79 0.73 0.75 0.79 0.81 0.82 0.93 1.00 0.81 0.78 embed-english-v3.0
           0.66 0.69 0.78 0.77 0.77 0.75 0.74 0.76 0.79 0.76 0.71 0.72 0.76 0.77 0.78 0.83 0.81 1.00 0.81 e5-base-v2               0.2
           0.63 0.66 0.74 0.73 0.74 0.75 0.73 0.73 0.74 0.72 0.70 0.69 0.73 0.74 0.75 0.79 0.78 0.81 1.00 e5-small-v2
                        gtr-t5-base



                     UAE-Large-V1




                          gte-base
                       gtr-t5-large
            mxbai-embed-large-v1

                 bge-large-en-v1.5
                 bge-small-en-v1.5
                          gte-small
                          gte-large
                 bge-base-en-v1.5

                  sentence-t5-base

            SFR-Embedding-Mistral




                        e5-base-v2
                       e5-small-v2
                 sentence-t5-large

           text-embedding-3-large
           text-embedding-3-small
                       e5-large-v2
               embed-english-v3.0




                                                                                                                                   0.1

                                                                                                                                           0            1000           2000   3000    4000        5000            6000

Figure 1: Mean CKA similarity across all five datasets. Models                                                                     Figure 2: Rank similarity over all π‘˜ on NFCorpus, comparing
tend to be most similar to models belonging to their own family,                                                                   gte-large to all other models. Scores are highest and vary most
though some interesting inter-family patterns are visible as well.                                                                 for small π‘˜, but then drop quickly before stabilizing for larger π‘˜.



lyzing retrieval similarity at low π‘˜ values for e.g. top-10 is                                                                     UAE, mxbai and bge-large, whose scores suggest almost
of most interest. As we are interested in identifying clusters                                                                     perfect embedding similarity. In fact, the similarity score of
of similar models, we also perform a hierarchical clustering                                                                       bge-large to these two models is much higher than to other
on heatmap values using Seaborn [42]. The following sec-                                                                           bge models.
tion describes the results of our evaluation for the different                                                                        Shifting our attention to top-π‘˜ retrieval similarity, clusters
measures.                                                                                                                          vary depending on the π‘˜ value. Figure 3 illustrates how
                                                                                                                                   Jaccard similarity evolves over π‘˜ on NFCorpus. The first
                                                                                                                                   plot displays Jaccard scores between bge-large and all other
5. Results                                                                                                                         models, while the second plot illustrates the scores for gte-
                                                                                                                                   large. For extremely low π‘˜, we observe some peaks for
To evaluate how similar embeddings generated by different                                                                          nearly all models, followed by a noticeable drop in similarity.
models are, we will first consider model families, checking if                                                                     Of course, for larger π‘˜, the scores converge to one. Re-
their pairwise and top-k similarity scores are highest within                                                                      affirming our earlier observations with the CKA metric,
their family. Subsequently, we will identify the open source                                                                       bge-large demonstrates high retrieval similarity with UAE
models which are most similar to our chosen proprietary                                                                            and mxbai. Similarity to the remaining models is much
models.                                                                                                                            lower, with the highest scores for bge-base and bge-small
                                                                                                                                   for larger π‘˜. However, especially for small π‘˜, there is high
5.1. Intra- and Inter-Family Clusters                                                                                              variance in similarity score, with models from other families,
                                                                                                                                   e.g. Mistral or gte-large sometimes achieving higher scores
Comparing embeddings directly with CKA shows high sim-
                                                                                                                                   than the bge models. A similar pattern can also be observed
ilarity across most of the models, albeit with some variance.
                                                                                                                                   in the second plot, where Jaccard similarity for gte-large
These scores allow us to identify certain clusters of models.
                                                                                                                                   is highest within its family for larger π‘˜, but models like
Figure 1 shows the pair-wise CKA scores of all models aver-
                                                                                                                                   mxbai or bge-base sometimes reporting higher similarity
aged across the five datasets. As expected, scores for most
                                                                                                                                   for small π‘˜. Therefore, the clusters we identified through
models are highest within their own family. This holds true
                                                                                                                                   our CKA analysis are only truly reflected in these plots for
for the gtr-t5, sentence-t5 and text-embedding-3 (OpenAI)
                                                                                                                                   large values of π‘˜. This suggest that in real-world use cases,
models. Although the sentence-t5 and gtr-t5 models are
                                                                                                                                   where the top-π‘˜ are crucial, such representational similarity
closely related, they do not exhibit significantly higher sim-
                                                                                                                                   measures might not provide the full picture. The plots for
ilarity with each other compared to the remaining models.
                                                                                                                                   other model families provide nearly identical insights as
   From an inter-family perspective, we observe high sim-
                                                                                                                                   those in the second plot in Figure 3 and thus we do not
ilarity between the bge and gte models. For some models
                                                                                                                                   present them for sake of brevity.
in these two families, interestingly, the highest similarity
                                                                                                                                      For rank similarity, scores peak for small π‘˜ and then
scores rather correspond to inter-family counterparts with
                                                                                                                                   quickly start to drop until they approach a low stable score
matching embedding dimensions than with models in the
                                                                                                                                   for larger π‘˜ as shown in Figure 2 for gte-large. Once again,
same family. Specifically, gte-small reports the highest simi-
                                                                                                                                   the bge/UAE/mxbai inter-family cluster shows the highest
larity to bge-small and gte-base to bge-base. On the other
                                                                                                                                   similarity. In contrast to Jaccard similarity, the clusters that
hand, gte-large shows slightly higher similarity to bge-base
                                                                                                                                   could be observed for CKA do not always show for rank
than bge-large and thus to a model with a lower embedding
                                                                                                                                   similarity. As can be seen in Figure 2, the model with the
dimension. Another inter-family cluster is formed by the
                                                                                                                                   highest rank similarity to gte-large is mxbai, rather than
three models with the highest CKA scores overall, namely
        bge-large-en-v1.5_vs_SFR-Embedding-Mistral                                   bge-large-en-v1.5_vs_gte-large                                gte-large_vs_SFR-Embedding-Mistral                                                     gte-large_vs_gte-base
        bge-large-en-v1.5_vs_UAE-Large-V1                                            bge-large-en-v1.5_vs_gte-small                                gte-large_vs_UAE-Large-V1                                                              gte-large_vs_gte-small
        bge-large-en-v1.5_vs_bge-base-en-v1.5                                        bge-large-en-v1.5_vs_gtr-t5-base                              gte-large_vs_bge-base-en-v1.5                                                          gte-large_vs_gtr-t5-base
        bge-large-en-v1.5_vs_bge-small-en-v1.5                                       bge-large-en-v1.5_vs_gtr-t5-large                             gte-large_vs_bge-large-en-v1.5                                                         gte-large_vs_gtr-t5-large
        bge-large-en-v1.5_vs_e5-base-v2                                              bge-large-en-v1.5_vs_mxbai-embed-large-v1                     gte-large_vs_bge-small-en-v1.5                                                         gte-large_vs_mxbai-embed-large-v1
        bge-large-en-v1.5_vs_e5-large-v2                                             bge-large-en-v1.5_vs_sentence-t5-base                         gte-large_vs_e5-base-v2                                                                gte-large_vs_sentence-t5-base
        bge-large-en-v1.5_vs_e5-small-v2                                             bge-large-en-v1.5_vs_sentence-t5-large                        gte-large_vs_e5-large-v2                                                               gte-large_vs_sentence-t5-large
        bge-large-en-v1.5_vs_embed-english-v3.0                                      bge-large-en-v1.5_vs_text-embedding-3-large                   gte-large_vs_e5-small-v2                                                               gte-large_vs_text-embedding-3-large
        bge-large-en-v1.5_vs_gte-base                                                bge-large-en-v1.5_vs_text-embedding-3-small                   gte-large_vs_embed-english-v3.0                                                        gte-large_vs_text-embedding-3-small
  1.0                                                                                                                                        1.0



  0.8                                                                                                                                        0.8



  0.6                                                                                                                                        0.6


                                                                             0.8                                                                                                                                        0.6
                                                                                                                                             0.4                                                                        0.5
  0.4                                                                        0.6
                                                                                                                                                                                                                        0.4
                                                                             0.4                                                                                                                                        0.3
                                                                                                                                                                                                                        0.2
                                                                             0.2                                                             0.2
  0.2                                                                                                                                                                                                                   0.1
                                                                             0.00  10 20                         30       40          50                                                                                0.00  10 20                         30       40          50
          0            1000               2000               3000             4000   5000                             6000                           0            1000               2000               3000             4000   5000                             6000
                                                                (a)                                                                                                                                        (b)

Figure 3: Jaccard similarity over all π‘˜ on NFCorpus, comparing bge-large (a) and gte-large (b) to all other models. While bge-large shows
high similarity to UAE-Large-v1 and mxbai-embed-large-v1, scores for gte-large are clustered much closer. Jaccard similarity seems to be
most unstable for small values of π‘˜, which would commonly be chosen for retrieval tasks.


         1.0                                                                                                                                        1.0
         0.8                                                                                                                                        0.8
         0.6                                                                                                                                        0.6
         0.4                                                                                                                                        0.4
         0.2         1.00 0.68 0.60 0.24 0.25 0.33 0.30 0.27 0.26 0.29 0.30 0.29 0.26 0.23 0.18 0.16 0.23 0.20 0.22 bge-large-en-v1.5                           1.00 0.43 0.29 0.32 0.24 0.35 0.26 0.24 0.27 0.32 0.28 0.31 0.34 0.35 0.28 0.30 0.31 0.26 0.28 sentence-t5-base
                     0.68 1.00 0.83 0.25 0.28 0.42 0.35 0.32 0.28 0.29 0.28 0.30 0.28 0.24 0.20 0.17 0.24 0.21 0.22 UAE-Large-V1                                0.43 1.00 0.28 0.34 0.24 0.30 0.28 0.27 0.26 0.21 0.29 0.28 0.37 0.32 0.32 0.34 0.36 0.29 0.28 sentence-t5-large
                     0.60 0.83 1.00 0.26 0.31 0.47 0.37 0.36 0.30 0.30 0.28 0.31 0.29 0.25 0.21 0.18 0.26 0.22 0.22 mxbai-embed-large-v1                        0.29 0.28 1.00 0.40 0.26 0.31 0.27 0.27 0.25 0.26 0.25 0.24 0.29 0.31 0.32 0.31 0.30 0.21 0.20 gtr-t5-base
                     0.24 0.25 0.26 1.00 0.31 0.25 0.26 0.22 0.19 0.18 0.19 0.20 0.23 0.18 0.17 0.14 0.16 0.16 0.17 bge-small-en-v1.5                           0.32 0.34 0.40 1.00 0.29 0.31 0.28 0.27 0.30 0.26 0.28 0.30 0.36 0.37 0.39 0.35 0.34 0.27 0.25 gtr-t5-large
                     0.25 0.28 0.31 0.31 1.00 0.38 0.29 0.35 0.29 0.27 0.24 0.26 0.28 0.25 0.18 0.16 0.19 0.20 0.20 gte-small                                   0.24 0.24 0.26 0.29 1.00 0.46 0.38 0.42 0.43 0.35 0.32 0.35 0.37 0.26 0.32 0.27 0.28 0.28 0.32 bge-small-en-v1.5
                     0.33 0.42 0.47 0.25 0.38 1.00 0.36 0.41 0.30 0.27 0.25 0.30 0.26 0.25 0.17 0.15 0.22 0.22 0.20 gte-large                                   0.35 0.30 0.31 0.31 0.46 1.00 0.48 0.52 0.49 0.43 0.44 0.42 0.45 0.37 0.39 0.40 0.40 0.36 0.36 bge-base-en-v1.5
                     0.30 0.35 0.37 0.26 0.29 0.36 1.00 0.46 0.32 0.28 0.28 0.29 0.30 0.24 0.20 0.19 0.23 0.25 0.22 bge-base-en-v1.5                            0.26 0.28 0.27 0.28 0.38 0.48 1.00 0.62 0.59 0.39 0.39 0.39 0.39 0.32 0.36 0.33 0.33 0.33 0.31 bge-large-en-v1.5
                     0.27 0.32 0.36 0.22 0.35 0.41 0.46 1.00 0.33 0.32 0.29 0.30 0.27 0.22 0.17 0.18 0.21 0.24 0.23 gte-base                                    0.24 0.27 0.27 0.27 0.42 0.52 0.62 1.00 0.74 0.42 0.43 0.42 0.41 0.33 0.34 0.30 0.34 0.32 0.32 UAE-Large-V1
                     0.26 0.28 0.30 0.19 0.29 0.30 0.32 0.33 1.00 0.36 0.32 0.42 0.30 0.30 0.23 0.21 0.27 0.23 0.23 text-embedding-3-small                      0.27 0.26 0.25 0.30 0.43 0.49 0.59 0.74 1.00 0.46 0.44 0.43 0.45 0.35 0.35 0.30 0.34 0.35 0.34 mxbai-embed-large-v1
                     0.29 0.29 0.30 0.18 0.27 0.27 0.28 0.32 0.36 1.00 0.39 0.40 0.30 0.33 0.21 0.23 0.27 0.21 0.24 embed-english-v3.0                          0.32 0.21 0.26 0.26 0.35 0.43 0.39 0.42 0.46 1.00 0.43 0.42 0.34 0.31 0.35 0.29 0.34 0.30 0.31 gte-large
                     0.30 0.28 0.28 0.19 0.24 0.25 0.28 0.29 0.32 0.39 1.00 0.42 0.29 0.22 0.18 0.22 0.23 0.19 0.23 SFR-Embedding-Mistral                       0.28 0.29 0.25 0.28 0.32 0.44 0.39 0.43 0.44 0.43 1.00 0.46 0.37 0.32 0.34 0.39 0.35 0.36 0.28 gte-base
                     0.29 0.30 0.31 0.20 0.26 0.30 0.29 0.30 0.42 0.40 0.42 1.00 0.30 0.30 0.23 0.21 0.24 0.22 0.25 text-embedding-3-large                      0.31 0.28 0.24 0.30 0.35 0.42 0.39 0.42 0.43 0.42 0.46 1.00 0.39 0.30 0.29 0.31 0.28 0.33 0.29 gte-small
                     0.26 0.28 0.29 0.23 0.28 0.26 0.30 0.27 0.30 0.30 0.29 0.30 1.00 0.37 0.29 0.18 0.24 0.21 0.22 e5-base-v2                                  0.34 0.37 0.29 0.36 0.37 0.45 0.39 0.41 0.45 0.34 0.37 0.39 1.00 0.43 0.42 0.48 0.42 0.43 0.44 e5-base-v2
                     0.23 0.24 0.25 0.18 0.25 0.25 0.24 0.22 0.30 0.33 0.22 0.30 0.37 1.00 0.29 0.16 0.22 0.18 0.18 e5-large-v2                                 0.35 0.32 0.31 0.37 0.26 0.37 0.32 0.33 0.35 0.31 0.32 0.30 0.43 1.00 0.44 0.42 0.49 0.34 0.36 text-embedding-3-small
                     0.18 0.20 0.21 0.17 0.18 0.17 0.20 0.17 0.23 0.21 0.18 0.23 0.29 0.29 1.00 0.16 0.21 0.17 0.14 e5-small-v2                                 0.28 0.32 0.32 0.39 0.32 0.39 0.36 0.34 0.35 0.35 0.34 0.29 0.42 0.44 1.00 0.50 0.46 0.31 0.28 SFR-Embedding-Mistral
                     0.16 0.17 0.18 0.14 0.16 0.15 0.19 0.18 0.21 0.23 0.22 0.21 0.18 0.16 0.16 1.00 0.38 0.20 0.22 gtr-t5-base                                 0.30 0.34 0.31 0.35 0.27 0.40 0.33 0.30 0.30 0.29 0.39 0.31 0.48 0.42 0.50 1.00 0.52 0.34 0.38 embed-english-v3.0
                     0.23 0.24 0.26 0.16 0.19 0.22 0.23 0.21 0.27 0.27 0.23 0.24 0.24 0.22 0.21 0.38 1.00 0.21 0.24 gtr-t5-large                                0.31 0.36 0.30 0.34 0.28 0.40 0.33 0.34 0.34 0.34 0.35 0.28 0.42 0.49 0.46 0.52 1.00 0.38 0.39 text-embedding-3-large
                     0.20 0.21 0.22 0.16 0.20 0.22 0.25 0.24 0.23 0.21 0.19 0.22 0.21 0.18 0.17 0.20 0.21 1.00 0.34 sentence-t5-base                            0.26 0.29 0.21 0.27 0.28 0.36 0.33 0.32 0.35 0.30 0.36 0.33 0.43 0.34 0.31 0.34 0.38 1.00 0.38 e5-large-v2
                     0.22 0.22 0.22 0.17 0.20 0.20 0.22 0.23 0.23 0.24 0.23 0.25 0.22 0.18 0.14 0.22 0.24 0.34 1.00 sentence-t5-large                           0.28 0.28 0.20 0.25 0.32 0.36 0.31 0.32 0.34 0.31 0.28 0.29 0.44 0.36 0.28 0.38 0.39 0.38 1.00 e5-small-v2
                               UAE-Large-V1




                      SFR-Embedding-Mistral

                                  e5-base-v2




                                                                                                                                                                               gte-base
                           bge-large-en-v1.5

                      mxbai-embed-large-v1
                           bge-small-en-v1.5
                                    gte-small
                                    gte-large
                           bge-base-en-v1.5
                                    gte-base




                                 e5-small-v2
                                  gtr-t5-base




                                                                                                                                                                       sentence-t5-base

                                                                                                                                                                             gtr-t5-base




                                                                                                                                                                          UAE-Large-V1
                                                                                                                                                                      bge-small-en-v1.5
                                                                                                                                                                      bge-base-en-v1.5




                                                                                                                                                                               gte-small
                                                                                                                                                                             e5-base-v2

                                                                                                                                                                 SFR-Embedding-Mistral
                     text-embedding-3-small
                         embed-english-v3.0




                                 e5-large-v2



                                 gtr-t5-large
                            sentence-t5-base
                     text-embedding-3-large




                           sentence-t5-large




                                                                                                                                                                      sentence-t5-large

                                                                                                                                                                            gtr-t5-large



                                                                                                                                                                      bge-large-en-v1.5

                                                                                                                                                                 mxbai-embed-large-v1
                                                                                                                                                                               gte-large




                                                                                                                                                                text-embedding-3-small

                                                                                                                                                                    embed-english-v3.0

                                                                                                                                                                            e5-large-v2
                                                                                                                                                                            e5-small-v2
                                                                                                                                                                text-embedding-3-large




                                                                (a)                                                                                                                                        (b)

Figure 4: Jaccard (a) and rank similarity (b) for the top-10 retrieved text chunks averaged over 25 queries on NFCorpus. The clusters
vary slightly depending on the measure, as do the scores. Models tend to be most similar to models from their own family. However,
some inter-family clusters are visible as well.



another gte model. Even so, the previously observed clus-                                                                                          card similarity between models across datasets. A striking
ters also tend to appear for rank similarity, though they                                                                                          insight here is that even the most similar models only report
vary more depending on the models and dataset. Gener-                                                                                              a Jaccard similarity of higher than 0.6, with the majority
ally, scores for nearly all models are rather small for larger                                                                                     less than 0.5. This is of great relevance to practitioners, as
π‘˜, indicating low rank similarity. For small π‘˜, results vary                                                                                       it would imply that even using embeddings from models
more and differences between individual models are more                                                                                            that report high representational similarity scores may yield
pronounced.                                                                                                                                        little overlap in retrieved text chunks. As earlier, the cluster
   As retrieval similarity at small π‘˜ is of most interest from a                                                                                   of UAE/mxbai/bge-large is the most prominent one with
practical perspective, we take a closer look at top-10 Jaccard                                                                                     the highest scores. Intra-family scores tend to be the high-
similarity. The heatmaps in Figures 4-6 show the top-10 Jac-                                                                                       est for most models, i.e. t5 and OpenAI. Depending on the
                                                                                                                                                                                                                                                                                                                                                                         1.0                                                                                                                                                                                                                                                                                                                                                                         1.0
   SFR-Embedding-Mistral 1.00 0.29 0.32 0.30 0.21 0.29 0.24 0.19 0.34 0.32 0.26 0.24 0.22 0.28 0.30 0.13 0.16 0.47 0.38                                                                                                                                                                                                                                                        SFR-Embedding-Mistral 1.00 0.45 0.47 0.43 0.39 0.47 0.52 0.50 0.44 0.43 0.42 0.36 0.49 0.52 0.44 0.50 0.49 0.61 0.55
            UAE-Large-V1 0.29 1.00 0.36 0.59 0.23 0.27 0.23 0.19 0.35 0.35 0.36 0.28 0.18 0.21 0.76 0.12 0.14 0.31 0.29                                                                                                                                                                                                                                                                 UAE-Large-V1 0.45 1.00 0.55 0.84 0.43 0.42 0.42 0.41 0.41 0.53 0.56 0.44 0.42 0.43 0.89 0.41 0.39 0.50 0.46
        bge-base-en-v1.5 0.32 0.36 1.00 0.33 0.26 0.28 0.24 0.19 0.37 0.44 0.30 0.29 0.18 0.20 0.35 0.13 0.12 0.32 0.34                                                                                                                                                                                                                                                             bge-base-en-v1.5 0.47 0.55 1.00 0.52 0.51 0.47 0.45 0.49 0.46 0.59 0.50 0.46 0.46 0.49 0.56 0.51 0.51 0.56 0.53
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     0.9
        bge-large-en-v1.5 0.30 0.59 0.33 1.00 0.24 0.23 0.22 0.16 0.35 0.35 0.31 0.25 0.18 0.20 0.59 0.10 0.13 0.30 0.30                                                                                                                                                                                                                                                            bge-large-en-v1.5 0.43 0.84 0.52 1.00 0.42 0.42 0.38 0.41 0.42 0.50 0.50 0.42 0.41 0.43 0.80 0.41 0.39 0.47 0.44
                                                                                                                                                                                                                                                                                                                                                                         0.8
        bge-small-en-v1.5 0.21 0.23 0.26 0.24 1.00 0.19 0.17 0.13 0.26 0.25 0.21 0.26 0.13 0.16 0.25 0.10 0.10 0.23 0.20                                                                                                                                                                                                                                                            bge-small-en-v1.5 0.39 0.43 0.51 0.42 1.00 0.42 0.38 0.43 0.41 0.50 0.48 0.63 0.41 0.42 0.45 0.51 0.43 0.46 0.45
              e5-base-v2 0.29 0.27 0.28 0.23 0.19 1.00 0.34 0.24 0.34 0.27 0.26 0.25 0.18 0.23 0.27 0.12 0.14 0.29 0.32                                                                                                                                                                                                                                                                   e5-base-v2 0.47 0.42 0.47 0.42 0.42 1.00 0.56 0.56 0.49 0.44 0.40 0.37 0.45 0.47 0.43 0.43 0.43 0.54 0.53
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     0.8
              e5-large-v2 0.24 0.23 0.24 0.22 0.17 0.34 1.00 0.24 0.30 0.22 0.24 0.22 0.15 0.17 0.25 0.10 0.13 0.23 0.25                                                                                                                                                                                                                                                                  e5-large-v2 0.52 0.42 0.45 0.38 0.38 0.56 1.00 0.54 0.49 0.39 0.39 0.34 0.48 0.50 0.40 0.42 0.47 0.53 0.54
             e5-small-v2 0.19 0.19 0.19 0.16 0.13 0.24 0.24 1.00 0.19 0.17 0.18 0.16 0.13 0.14 0.19 0.10 0.10 0.22 0.21                                                                                                                                                                                                                                                                   e5-small-v2 0.50 0.41 0.49 0.41 0.43 0.56 0.54 1.00 0.51 0.40 0.38 0.37 0.46 0.50 0.41 0.43 0.45 0.54 0.52
      embed-english-v3.0 0.34 0.35 0.37 0.35 0.26 0.34 0.30 0.19 1.00 0.30 0.29 0.26 0.21 0.23 0.37 0.12 0.14 0.33 0.36                                                                                                                                                                                                                                                  0.6      embed-english-v3.0 0.44 0.41 0.46 0.42 0.41 0.49 0.49 0.51 1.00 0.39 0.35 0.36 0.45 0.43 0.41 0.38 0.37 0.48 0.49                                                                                                                                                                                                                                                  0.7
                gte-base 0.32 0.35 0.44 0.35 0.25 0.27 0.22 0.17 0.30 1.00 0.43 0.38 0.16 0.21 0.37 0.12 0.14 0.33 0.33                                                                                                                                                                                                                                                                     gte-base 0.43 0.53 0.59 0.50 0.50 0.44 0.39 0.40 0.39 1.00 0.55 0.56 0.43 0.43 0.56 0.48 0.43 0.53 0.50
                gte-large 0.26 0.36 0.30 0.31 0.21 0.26 0.24 0.18 0.29 0.43 1.00 0.38 0.17 0.20 0.41 0.10 0.12 0.29 0.28                                                                                                                                                                                                                                                                    gte-large 0.42 0.56 0.50 0.50 0.48 0.40 0.39 0.38 0.35 0.55 1.00 0.54 0.40 0.40 0.57 0.44 0.45 0.48 0.47
               gte-small 0.24 0.28 0.29 0.25 0.26 0.25 0.22 0.16 0.26 0.38 0.38 1.00 0.17 0.19 0.32 0.09 0.11 0.25 0.23                                                                                                                                                                                                                                                                     gte-small 0.36 0.44 0.46 0.42 0.63 0.37 0.34 0.37 0.36 0.56 0.54 1.00 0.37 0.37 0.47 0.46 0.40 0.44 0.44                                                                                                                                                                                                                                                 0.6
              gtr-t5-base 0.22 0.18 0.18 0.18 0.13 0.18 0.15 0.13 0.21 0.16 0.17 0.17 1.00 0.35 0.19 0.11 0.12 0.24 0.22                                                                                                                                                                                                                                                 0.4              gtr-t5-base 0.49 0.42 0.46 0.41 0.41 0.45 0.48 0.46 0.45 0.43 0.40 0.37 1.00 0.63 0.43 0.47 0.49 0.51 0.50
              gtr-t5-large 0.28 0.21 0.20 0.20 0.16 0.23 0.17 0.14 0.23 0.21 0.20 0.19 0.35 1.00 0.23 0.10 0.14 0.28 0.25                                                                                                                                                                                                                                                                 gtr-t5-large 0.52 0.43 0.49 0.43 0.42 0.47 0.50 0.50 0.43 0.43 0.40 0.37 0.63 1.00 0.44 0.47 0.51 0.53 0.53
   mxbai-embed-large-v1 0.30 0.76 0.35 0.59 0.25 0.27 0.25 0.19 0.37 0.37 0.41 0.32 0.19 0.23 1.00 0.12 0.14 0.34 0.30                                                                                                                                                                                                                                                          mxbai-embed-large-v1 0.44 0.89 0.56 0.80 0.45 0.43 0.40 0.41 0.41 0.56 0.57 0.47 0.43 0.44 1.00 0.44 0.42 0.50 0.47                                                                                                                                                                                                                                                  0.5

        sentence-t5-base 0.13 0.12 0.13 0.10 0.10 0.12 0.10 0.10 0.12 0.12 0.10 0.09 0.11 0.10 0.12 1.00 0.25 0.14 0.13                                                                                                                                                                                                                                                             sentence-t5-base 0.50 0.41 0.51 0.41 0.51 0.43 0.42 0.43 0.38 0.48 0.44 0.46 0.47 0.47 0.44 1.00 0.61 0.54 0.51
        sentence-t5-large 0.16 0.14 0.12 0.13 0.10 0.14 0.13 0.10 0.14 0.14 0.12 0.11 0.12 0.14 0.14 0.25 1.00 0.16 0.17                                                                                                                                                                                                                                                 0.2        sentence-t5-large 0.49 0.39 0.51 0.39 0.43 0.43 0.47 0.45 0.37 0.43 0.45 0.40 0.49 0.51 0.42 0.61 1.00 0.55 0.53
   text-embedding-3-large 0.47 0.31 0.32 0.30 0.23 0.29 0.23 0.22 0.33 0.33 0.29 0.25 0.24 0.28 0.34 0.14 0.16 1.00 0.44                                                                                                                                                                                                                                                       text-embedding-3-large 0.61 0.50 0.56 0.47 0.46 0.54 0.53 0.54 0.48 0.53 0.48 0.44 0.51 0.53 0.50 0.54 0.55 1.00 0.66                                                                                                                                                                                                                                                 0.4

  text-embedding-3-small 0.38 0.29 0.34 0.30 0.20 0.32 0.25 0.21 0.36 0.33 0.28 0.23 0.22 0.25 0.30 0.13 0.17 0.44 1.00                                                                                                                                                                                                                                                        text-embedding-3-small 0.55 0.46 0.53 0.44 0.45 0.53 0.54 0.52 0.49 0.50 0.47 0.44 0.50 0.53 0.47 0.51 0.53 0.66 1.00
                         SFR-Embedding-Mistral
                                                 UAE-Large-V1




                                                                                                                                                                                                                                                                                                                                                                                                     SFR-Embedding-Mistral
                                                                bge-base-en-v1.5
                                                                                   bge-large-en-v1.5
                                                                                                       bge-small-en-v1.5




                                                                                                                                                                                                                                                                                                                                                                                                                             UAE-Large-V1
                                                                                                                           e5-base-v2
                                                                                                                                        e5-large-v2
                                                                                                                                                      e5-small-v2
                                                                                                                                                                    embed-english-v3.0
                                                                                                                                                                                         gte-base
                                                                                                                                                                                                    gte-large
                                                                                                                                                                                                                gte-small




                                                                                                                                                                                                                                                         mxbai-embed-large-v1
                                                                                                                                                                                                                            gtr-t5-base
                                                                                                                                                                                                                                          gtr-t5-large


                                                                                                                                                                                                                                                                                sentence-t5-base
                                                                                                                                                                                                                                                                                                   sentence-t5-large
                                                                                                                                                                                                                                                                                                                       text-embedding-3-large
                                                                                                                                                                                                                                                                                                                                                text-embedding-3-small




                                                                                                                                                                                                                                                                                                                                                                                                                                            bge-base-en-v1.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                               bge-large-en-v1.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   bge-small-en-v1.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       e5-base-v2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    e5-large-v2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  e5-small-v2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                embed-english-v3.0
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     gte-base
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                gte-large
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            gte-small




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     mxbai-embed-large-v1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        gtr-t5-base
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      gtr-t5-large


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            sentence-t5-base
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               sentence-t5-large
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   text-embedding-3-large
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            text-embedding-3-small
                                                                                                                                                                    (a)                                                                                                                                                                                                                                                                                                                                                                         (b)

Figure 5: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on SciFact (a) and ArguAna (b). The UAE and
mxbai models show high levels of similarity along with bge-large. The remaining models tend to show the highest similarity within their
own family with the exception of the bge/gte inter-family cluster.



dataset, this also applies to gte and e5 models, although                                                                                                                                                                                                                                                                                                                           retrieval similarity between Mistral and OpenAI models
Jaccard similarity to models from other families is some-                                                                                                                                                                                                                                                                                                                           is only low to moderate. On smaller datasets, the highest
times higher. We also note that for the two larger datasets                                                                                                                                                                                                                                                                                                                         Jaccard similarity to text-embedding-3-large only reaches
FiQA-2018 and TREC-COVID, the similarity scores are gen-                                                                                                                                                                                                                                                                                                                            about 0.6 (see Figure 5), while on TREC-COVID, the largest
erally substantially lower, as can be seen in Figure 6. For                                                                                                                                                                                                                                                                                                                         dataset, Jaccard similarity goes down to merely 0.18 (see
the smaller datasets, Jaccard similarity is generally higher,                                                                                                                                                                                                                                                                                                                       Figure 6). For Cohere’s model, the most similar model for
though results differ depending on the data (see Figures 4                                                                                                                                                                                                                                                                                                                          top-10 Jaccard similarity is different for each dataset, with
and 5). Similar observations can be made for rank similarity,                                                                                                                                                                                                                                                                                                                       the highest scores of 0.51 occurring on ArguAna shwon in
although the appearance of family clusters is more depen-                                                                                                                                                                                                                                                                                                                           Figure 5. For all proprietary models, even the best retrieval
dent on the dataset. Larger datasets also lead to lower scores.                                                                                                                                                                                                                                                                                                                     similarity at top-10 still suggests that the embeddings that
These results illustrate that while family clusters can still                                                                                                                                                                                                                                                                                                                       would be presented to an LLM can differ notably. Once
be perceived at small π‘˜, they are not as prominent as they                                                                                                                                                                                                                                                                                                                          again, we could also observe dataset-dependent variance in
are for larger π‘˜. Furthermore, the top-10 retrieved results                                                                                                                                                                                                                                                                                                                         scores, with lower retrieval similarity on larger datasets.
differ substantially for most models and datasets and their
similarity might be dependent on the dataset itself.
                                                                                                                                                                                                                                                                                                                                                                                    6. Discussion
5.2. Open Source Alternatives to                                                                                                                                                                                                                                                                                                                                                    While a pair-wise comparison of embeddings using CKA
     Proprietary Models                                                                                                                                                                                                                                                                                                                                                             shows intra- and inter-family model clusters, retrieval simi-
                                                                                                                                                                                                                                                                                                                                                                                    larity over different π‘˜ offers a more nuanced picture. Espe-
We explicitly included proprietary models in our analysis to                                                                                                                                                                                                                                                                                                                        cially for small π‘˜, which are of most interest from a practical
check which open source models are the best - which in our                                                                                                                                                                                                                                                                                                                          perspective, retrieval similarity varies. When comparing
case means the most similar - alternative. The CKA scores                                                                                                                                                                                                                                                                                                                           the top-10 retrieved text chunks, the low Jaccard similarity
in Figure 1 indicate that embeddings generated by OpenAI’s                                                                                                                                                                                                                                                                                                                          scores indicate little overlap in retrieved chunks, even when
models (text-embedding-3-large/-small) are highly similar to                                                                                                                                                                                                                                                                                                                        CKA scores are high. Especially for the two larger datasets
those generated by Mistral, while the Cohere model (embed-                                                                                                                                                                                                                                                                                                                          FiQA-2018 and TREC-COVID, these scores are extremely
english-v3.0) demonstrates high similarity to e5-large-v2.                                                                                                                                                                                                                                                                                                                          low. As RAG systems usually operate on millions of em-
   These observations do not entirely extend to retrieval sim-                                                                                                                                                                                                                                                                                                                      beddings, our datasets are comparatively small. Therefore,
ilarity, especially for Cohere. While Mistral is still the most                                                                                                                                                                                                                                                                                                                     should a general trend of larger datasets leading to lower
similar model to OpenAI’s for larger π‘˜ across all datasets,                                                                                                                                                                                                                                                                                                                         retrieval similarity exist, text chunks retrieved by differ-
there is no consistently most similar model for Cohere.                                                                                                                                                                                                                                                                                                                             ent models in a regular use case might be nearly distinct
Rather, the model varies depending on the dataset and mea-                                                                                                                                                                                                                                                                                                                          for small π‘˜. Overall, our results suggest that even though
sure - Jaccard and rank similarity - sometimes showing high-                                                                                                                                                                                                                                                                                                                        embeddings seem rather similar when compared directly,
est similarity to e5-large-v2, but sometimes also to other                                                                                                                                                                                                                                                                                                                          retrieval performance can still vary substantially, is most
models like Mistral. Taking a closer look at top-10 similar-                                                                                                                                                                                                                                                                                                                        unstable for π‘˜ values that are commonly used in RAG sys-
ity, Mistral still largely exhibits the highest similarity to the                                                                                                                                                                                                                                                                                                                   tems and also dataset-dependent. Retrieved chunks at small
OpenAI models, especially to text-embedding-3-large. For                                                                                                                                                                                                                                                                                                                            π‘˜ show the least overlap, leading to high differences in data
text-embedding-3-small, scores on all datasets are rather                                                                                                                                                                                                                                                                                                                           that would be presented to an LLM as additional context.
close to those of other models. In absolute terms, however,
                                                                                                                                                                                                                                                                                                                                                   1.0                                                                                                                                                                                                                                                                                                                                                    1.0
   SFR-Embedding-Mistral 1.00 0.11 0.11 0.12 0.09 0.13 0.13 0.10 0.14 0.12 0.12 0.12 0.15 0.12 0.11 0.11 0.25 0.18                                                                                                                                                                                                                                       SFR-Embedding-Mistral 1.00 0.08 0.12 0.10 0.08 0.12 0.06 0.04 0.09 0.07 0.08 0.09 0.11 0.08 0.05 0.05 0.18 0.18
           UAE-Large-V1 0.11 1.00 0.20 0.52 0.16 0.12 0.11 0.10 0.21 0.35 0.22 0.12 0.11 0.64 0.13 0.09 0.15 0.17                                                                                                                                                                                                                                                UAE-Large-V1 0.08 1.00 0.23 0.54 0.25 0.11 0.07 0.04 0.13 0.21 0.14 0.12 0.11 0.71 0.10 0.08 0.09 0.10
        bge-base-en-v1.5 0.11 0.20 1.00 0.17 0.19 0.13 0.11 0.09 0.26 0.20 0.16 0.10 0.10 0.20 0.11 0.07 0.11 0.14                                                                                                                                                                                                                                            bge-base-en-v1.5 0.12 0.23 1.00 0.18 0.20 0.14 0.08 0.05 0.24 0.19 0.16 0.09 0.11 0.24 0.11 0.10 0.12 0.15
        bge-large-en-v1.5 0.12 0.52 0.17 1.00 0.12 0.13 0.10 0.09 0.19 0.28 0.17 0.12 0.11 0.49 0.12 0.10 0.15 0.17                                                                                                                                                                                                                                           bge-large-en-v1.5 0.10 0.54 0.18 1.00 0.21 0.10 0.06 0.03 0.12 0.14 0.12 0.13 0.13 0.51 0.09 0.07 0.09 0.11                                                                                                                                                                                                                                 0.8
                                                                                                                                                                                                                                                                                                                                                   0.8
       bge-small-en-v1.5 0.09 0.16 0.19 0.12 1.00 0.10 0.10 0.11 0.14 0.14 0.24 0.11 0.10 0.15 0.08 0.06 0.10 0.10                                                                                                                                                                                                                                            bge-small-en-v1.5 0.08 0.25 0.20 0.21 1.00 0.11 0.09 0.05 0.12 0.18 0.20 0.11 0.09 0.23 0.09 0.05 0.09 0.09
             e5-base-v2 0.13 0.12 0.13 0.13 0.10 1.00 0.19 0.18 0.15 0.13 0.13 0.10 0.10 0.13 0.09 0.08 0.14 0.15                                                                                                                                                                                                                                                   e5-base-v2 0.12 0.11 0.14 0.10 0.11 1.00 0.22 0.16 0.09 0.12 0.11 0.09 0.10 0.12 0.08 0.05 0.13 0.15
             e5-large-v2 0.13 0.11 0.11 0.10 0.10 0.19 1.00 0.16 0.10 0.09 0.09 0.10 0.11 0.10 0.08 0.06 0.11 0.13                                                                                                                                                                                                                                                 e5-large-v2 0.06 0.07 0.08 0.06 0.09 0.22 1.00 0.22 0.04 0.08 0.07 0.04 0.05 0.07 0.05 0.03 0.07 0.06
             e5-small-v2 0.10 0.10 0.09 0.09 0.11 0.18 0.16 1.00 0.07 0.08 0.09 0.07 0.08 0.09 0.07 0.07 0.07 0.08                                                                                                                                                                                                                                                 e5-small-v2 0.04 0.04 0.05 0.03 0.05 0.16 0.22 1.00 0.02 0.05 0.05 0.04 0.04 0.05 0.05 0.03 0.04 0.05                                                                                                                                                                                                                                  0.6
                                                                                                                                                                                                                                                                                                                                                   0.6
               gte-base 0.14 0.21 0.26 0.19 0.14 0.15 0.10 0.07 1.00 0.29 0.24 0.12 0.11 0.25 0.12 0.10 0.15 0.18                                                                                                                                                                                                                                                     gte-base 0.09 0.13 0.24 0.12 0.12 0.09 0.04 0.02 1.00 0.23 0.21 0.08 0.08 0.15 0.09 0.05 0.12 0.17
               gte-large 0.12 0.35 0.20 0.28 0.14 0.13 0.09 0.08 0.29 1.00 0.27 0.12 0.10 0.40 0.14 0.10 0.17 0.19                                                                                                                                                                                                                                                   gte-large 0.07 0.21 0.19 0.14 0.18 0.12 0.08 0.05 0.23 1.00 0.28 0.07 0.07 0.23 0.10 0.06 0.12 0.13
               gte-small 0.12 0.22 0.16 0.17 0.24 0.13 0.09 0.09 0.24 0.27 1.00 0.10 0.10 0.24 0.12 0.08 0.16 0.17                                                                                                                                                                                                                                                   gte-small 0.08 0.14 0.16 0.12 0.20 0.11 0.07 0.05 0.21 0.28 1.00 0.07 0.08 0.17 0.10 0.07 0.11 0.12
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0.4
             gtr-t5-base 0.12 0.12 0.10 0.12 0.11 0.10 0.10 0.07 0.12 0.12 0.10 1.00 0.27 0.12 0.16 0.12 0.14 0.14                                                                                                                                                                                                                                 0.4              gtr-t5-base 0.09 0.12 0.09 0.13 0.11 0.09 0.04 0.04 0.08 0.07 0.07 1.00 0.26 0.13 0.10 0.09 0.09 0.10
             gtr-t5-large 0.15 0.11 0.10 0.11 0.10 0.10 0.11 0.08 0.11 0.10 0.10 0.27 1.00 0.13 0.13 0.14 0.16 0.15                                                                                                                                                                                                                                                gtr-t5-large 0.11 0.11 0.11 0.13 0.09 0.10 0.05 0.04 0.08 0.07 0.08 0.26 1.00 0.12 0.08 0.08 0.10 0.14
   mxbai-embed-large-v1 0.12 0.64 0.20 0.49 0.15 0.13 0.10 0.09 0.25 0.40 0.24 0.12 0.13 1.00 0.15 0.12 0.16 0.18                                                                                                                                                                                                                                         mxbai-embed-large-v1 0.08 0.71 0.24 0.51 0.23 0.12 0.07 0.05 0.15 0.23 0.17 0.13 0.12 1.00 0.12 0.09 0.09 0.11
        sentence-t5-base 0.11 0.13 0.11 0.12 0.08 0.09 0.08 0.07 0.12 0.14 0.12 0.16 0.13 0.15 1.00 0.21 0.13 0.13                                                                                                                                                                                                                                            sentence-t5-base 0.05 0.10 0.11 0.09 0.09 0.08 0.05 0.05 0.09 0.10 0.10 0.10 0.08 0.12 1.00 0.23 0.07 0.07
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          0.2
        sentence-t5-large 0.11 0.09 0.07 0.10 0.06 0.08 0.06 0.07 0.10 0.10 0.08 0.12 0.14 0.12 0.21 1.00 0.11 0.12                                                                                                                                                                                                                                0.2        sentence-t5-large 0.05 0.08 0.10 0.07 0.05 0.05 0.03 0.03 0.05 0.06 0.07 0.09 0.08 0.09 0.23 1.00 0.06 0.07
   text-embedding-3-large 0.25 0.15 0.11 0.15 0.10 0.14 0.11 0.07 0.15 0.17 0.16 0.14 0.16 0.16 0.13 0.11 1.00 0.31                                                                                                                                                                                                                                      text-embedding-3-large 0.18 0.09 0.12 0.09 0.09 0.13 0.07 0.04 0.12 0.12 0.11 0.09 0.10 0.09 0.07 0.06 1.00 0.29
  text-embedding-3-small 0.18 0.17 0.14 0.17 0.10 0.15 0.13 0.08 0.18 0.19 0.17 0.14 0.15 0.18 0.13 0.12 0.31 1.00                                                                                                                                                                                                                                       text-embedding-3-small 0.18 0.10 0.15 0.11 0.09 0.15 0.06 0.05 0.17 0.13 0.12 0.10 0.14 0.11 0.07 0.07 0.29 1.00
                        SFR-Embedding-Mistral
                                                UAE-Large-V1




                                                                                                                                                                                                                                                                                                                                                                               SFR-Embedding-Mistral
                                                               bge-base-en-v1.5
                                                                                  bge-large-en-v1.5
                                                                                                      bge-small-en-v1.5




                                                                                                                                                                                                                                                                                                                                                                                                       UAE-Large-V1
                                                                                                                          e5-base-v2
                                                                                                                                       e5-large-v2
                                                                                                                                                     e5-small-v2
                                                                                                                                                                   gte-base
                                                                                                                                                                              gte-large
                                                                                                                                                                                          gte-small




                                                                                                                                                                                                                                   mxbai-embed-large-v1
                                                                                                                                                                                                      gtr-t5-base
                                                                                                                                                                                                                    gtr-t5-large


                                                                                                                                                                                                                                                          sentence-t5-base
                                                                                                                                                                                                                                                                             sentence-t5-large
                                                                                                                                                                                                                                                                                                 text-embedding-3-large
                                                                                                                                                                                                                                                                                                                          text-embedding-3-small




                                                                                                                                                                                                                                                                                                                                                                                                                      bge-base-en-v1.5
                                                                                                                                                                                                                                                                                                                                                                                                                                         bge-large-en-v1.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                             bge-small-en-v1.5
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 e5-base-v2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              e5-large-v2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            e5-small-v2
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          gte-base
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     gte-large
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 gte-small




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          mxbai-embed-large-v1




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        text-embedding-3-large
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 text-embedding-3-small
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             gtr-t5-base
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           gtr-t5-large


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 sentence-t5-base
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    sentence-t5-large
                                                                                                                                                              (a)                                                                                                                                                                                                                                                                                                                                                    (b)

Figure 6: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on FiQA-2018 (a) and TREC-COVID (b). Most
models seem to retrieve almost completely distinct text chunks. Only the bge/UAE/mxbai cluster still shows a notable level of similarity,
while the scores of the remaining clusters indicate only moderate to low levels of similarity.



   Our analysis demonstrates that although models tend                                                                                                                                                                                                                                                                                                        lower similarity, especially on larger datasets. Although we
to be most similar to models from their own family, inter-                                                                                                                                                                                                                                                                                                    were able to identify some model clusters, our results sug-
family clusters exist. The most prominent of these clusters                                                                                                                                                                                                                                                                                                   gest that choosing the optimal model remains a non-trivial
is formed by the models bge-large-en-v1.5, UAE-Large-V1                                                                                                                                                                                                                                                                                                       task that requires careful consideration.
and mxbai-embed-large-v1, which demonstrate high sim-                                                                                                                                                                                                                                                                                                            Still, we argue that a better understanding of how sim-
ilarity even for retrieval at low π‘˜. Nevertheless, the high                                                                                                                                                                                                                                                                                                   ilarly different embedding models behave is an important
variance of retrieval similarity of the remaining clusters                                                                                                                                                                                                                                                                                                    research topic that requires further attention. There are a
for small π‘˜ means that while the identified clusters might                                                                                                                                                                                                                                                                                                    plethora of real-world scenarios where RAG systems can
provide some measure of orientation when choosing an em-                                                                                                                                                                                                                                                                                                      potentially be deployed. One such scenario, for example,
bedding model, the choice still remains a non-trivial task.                                                                                                                                                                                                                                                                                                   is to retrieve relevant web content in response to a query.
Identifying suitable alternatives to proprietary models is                                                                                                                                                                                                                                                                                                    As large corpora of such data are available in the form of
likewise not as simple. While we were able to determine                                                                                                                                                                                                                                                                                                       Web ARChive (WARC) files, evaluating embedding model
SFR-Embedding-Mistral as the model being most similar to                                                                                                                                                                                                                                                                                                      similarity on such large, uncleaned datasets would perhaps
OpenAI’s embedding models, Jaccard similarity at top-10                                                                                                                                                                                                                                                                                                       lead to a better estimation of model similarity for a realistic
for larger datasets shows a low overlap in retrieved text                                                                                                                                                                                                                                                                                                     RAG use case. Additionally, as documents often need to
chunks. Furthermore, for Cohere’s embedding model, we                                                                                                                                                                                                                                                                                                         be chunked into smaller parts to fit into the models, the
were unable to find a single most similar model, as this                                                                                                                                                                                                                                                                                                      effect of chunking strategies such as token-based or seman-
model varied across datasets for small π‘˜ values.                                                                                                                                                                                                                                                                                                              tic chunking on embedding similarity could be explored.
                                                                                                                                                                                                                                                                                                                                                              Furthermore, our evaluation focused on a small sample of
                                                                                                                                                                                                                                                                                                                                                              similarity measures, with their own definition about which
7. Conclusion                                                                                                                                                                                                                                                                                                                                                 criteria make models similar. Introducing more measures
                                                                                                                                                                                                                                                                                                                                                              with different perspectives could improve our understand-
In this paper we evaluated the similarity of embedding mod-
                                                                                                                                                                                                                                                                                                                                                              ing on which factors influence model similarity. Finally,
els on different datasets. Given the large number of available
                                                                                                                                                                                                                                                                                                                                                              our focus was on identifying clusters or families of models,
models, identifying clusters or families of models with sim-
                                                                                                                                                                                                                                                                                                                                                              which for our initial experiments led us to choosing families
ilar embeddings can simplify the model selection process.
                                                                                                                                                                                                                                                                                                                                                              of embedding models with varying performance on MTEB.
While previous work on LLM similarity exists, to the best
                                                                                                                                                                                                                                                                                                                                                              With the frequent appearance of new models on the leader-
of the authors’ knowledge, it so far lacks a clear focus on
                                                                                                                                                                                                                                                                                                                                                              board and the focus on good MTEB performance, it would
embedding models specifically in the context of RAG. We
                                                                                                                                                                                                                                                                                                                                                              be of interest to compare the best performing models on
therefore analyzed the similarity of embeddings generated
                                                                                                                                                                                                                                                                                                                                                              MTEB and check if their relative difference in performance
by 19 different models using CKA for pairwise comparison
                                                                                                                                                                                                                                                                                                                                                              correlates with how similar these models are.
as well as Jaccard and rank similarity to compare retrieval
behavior at top-π‘˜ across five datasets. Comparing embed-
dings with CKA generally showed intra- and inter-family                                                                                                                                                                                                                                                                                                       Acknowledgments
clusters across datasets. These clusters also appeared when
evaluating top-π‘˜ retrieval similarity with large π‘˜ values.                                                                                                                                                                                                                                                                                                    This work has received funding from the European Union’s
However, scores for low π‘˜ values, which would commonly                                                                                                                                                                                                                                                                                                        Horizon Europe research and innovation program under
be chosen in RAG systems, show high variance and much                                                                                                                                                                                                                                                                                                         grant agreement No 101070014 (OpenWebSearch.EU, https:
                                                                                                                                                                                                                                                                                                                                                              //doi.org/10.3030/101070014).
References                                                             Learning Research, PMLR, 2019, pp. 3519–3529. URL:
                                                                       https://proceedings.mlr.press/v97/kornblith19a.html.
 [1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,    [15] Y. Bansal, P. Nakkiran, B. Barak, Revisiting model
     Y. J. Bang, A. Madotto, P. Fung, Survey of hallucina-             stitching to compare neural representations, in:
     tion in natural language generation, ACM Computing                M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang,
     Surveys 55 (2023) 1–38.                                           J. W. Vaughan (Eds.), Advances in Neural Infor-
 [2] S. M. Mousavi, S. Alghisi, G. Riccardi, Is your                   mation Processing Systems, volume 34, Curran
     llm outdated? benchmarking llms & alignment                       Associates, Inc., 2021, pp. 225–236. URL: https:
     algorithms for time-sensitive knowledge, 2024.                    //proceedings.neurips.cc/paper_files/paper/2021/file/
     arXiv:2404.08700.                                                 01ded4259d101feb739b06c399e9cd9c-Paper.pdf.
 [3] N. Muennighoff, N. Tazi, L. Magne, N. Reimers,               [16] K. Lenc, A. Vedaldi, Understanding image representa-
     Mteb: Massive text embedding benchmark, 2023.                     tions by measuring their equivariance and equivalence,
     arXiv:2210.07316.                                                 in: Proceedings of the IEEE conference on computer
 [4] M. Klabunde, T. Schumacher, M. Strohmaier, F. Lem-                vision and pattern recognition, 2015, pp. 991–999.
     merich, Similarity of neural network models: A survey        [17] A. Balogh, M. Jelasity, On the functional similar-
     of functional and representational measures, arXiv                ity of robust and non-robust neural representations,
     preprint arXiv:2305.06329 (2023).                                 in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt,
 [5] M. Klabunde, M. B. Amor, M. Granitzer, F. Lem-                    S. Sabato, J. Scarlett (Eds.), Proceedings of the 40th
     merich, Towards measuring representational sim-                   International Conference on Machine Learning, vol-
     ilarity of large language models, arXiv preprint                  ume 202 of Proceedings of Machine Learning Research,
     arXiv:2312.02730 (2023).                                          PMLR, 2023, pp. 1614–1635. URL: https://proceedings.
 [6] M. Raghu, J. Gilmer, J. Yosinski, J. Sohl-Dickstein,              mlr.press/v202/balogh23a.html.
     Svcca: Singular vector canonical correlation analy-          [18] M. Milani Fard, Q. Cormier, K. Canini, M. Gupta,
     sis for deep learning dynamics and interpretability,              Launch and iterate: Reducing prediction churn, Ad-
     Advances in neural information processing systems                 vances in Neural Information Processing Systems 29
     30 (2017).                                                        (2016).
 [7] A. Morcos, M. Raghu, S. Bengio, Insights on represen-        [19] X. Xie, L. Ma, H. Wang, Y. Li, Y. Liu, X. Li, Diffchaser:
     tational similarity in neural networks with canonical             Detecting disagreements for deep neural networks, In-
     correlation, Advances in neural information process-              ternational Joint Conferences on Artificial Intelligence
     ing systems 31 (2018).                                            Organization, 2019.
 [8] D. R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical        [20] Y. Li, Z. Zhang, B. Liu, Z. Yang, Y. Liu, Modeldiff:
     correlation analysis: An overview with application to             testing-based dnn similarity comparison for model
     learning methods, Neural computation 16 (2004) 2639–              reuse detection, in: Proceedings of the 30th ACM
     2664.                                                             SIGSOFT International Symposium on Software Test-
 [9] F. Ding, J.-S. Denain, J. Steinhardt, Grounding rep-              ing and Analysis, ISSTA ’21, ACM, 2021. URL: http:
     resentation similarity through statistical testing, Ad-           //dx.doi.org/10.1145/3460319.3464816. doi:10.1145/
     vances in Neural Information Processing Systems 34                3460319.3464816.
     (2021) 1556–1568.                                            [21] J. M. Wu, Y. Belinkov, H. Sajjad, N. Durrani, F. Dalvi,
[10] M. Zullich, F. Pellegrino, E. Medvet, A. Ansuini, et al.,         J. Glass, Similarity analysis of contextual word repre-
     On the similarity between hidden layers of pruned and             sentation models, 2020. arXiv:2005.01172.
     unpruned convolutional neural networks, in: Proceed-         [22] M. Freestone, S. K. K. Santu, Word embeddings
     ings of the 9th International Conference on Pattern               revisited: Do llms offer something new?, 2024.
     Recognition Applications and Methods, Scitepress,                 arXiv:2402.11094.
     2020, pp. 52–59.                                             [23] D. Brown, C. Godfrey, N. Konz, J. Tu, H. Kvinge,
[11] W. Chen, Z. Miao, Q. Qiu, Inner product-based neural              Understanding the inner workings of language
     network similarity, Advances in Neural Information                models through representation dissimilarity, 2023.
     Processing Systems 36 (2024).                                     arXiv:2310.14993.
[12] Y. Li, J. Yosinski, J. Clune, H. Lipson, J. Hopcroft,        [24] N. Thakur, N. Reimers, A. RΓΌcklΓ©, A. Srivastava,
     Convergent learning: Do different neural net-                     I. Gurevych, Beir: A heterogenous benchmark for zero-
     works learn the same representations?, 2016.                      shot evaluation of information retrieval models, 2021.
     arXiv:1511.07543.                                                 arXiv:2104.08663.
[13] Y. Li, J. Yosinski, J. Clune, H. Lipson, J. Hopcroft, Con-   [25] P. Finardi, L. Avila, R. Castaldoni, P. Gengo, C. Larcher,
     vergent learning: Do different neural networks learn              M. Piau, P. Costa, V. CaridΓ‘, The chronicles of rag:
     the same representations?, in: D. Storcheus, A. Ros-              The retriever, the chunk and the generator, 2024.
     tamizadeh, S. Kumar (Eds.), Proceedings of the 1st                arXiv:2401.07883.
     International Workshop on Feature Extraction: Mod-           [26] K. Sawarkar, A. Mangal, S. R. Solanki, Blended rag:
     ern Questions and Challenges at NIPS 2015, volume 44              Improving rag (retriever-augmented generation) ac-
     of Proceedings of Machine Learning Research, PMLR,                curacy with semantic search and hybrid query-based
     Montreal, Canada, 2015, pp. 196–212. URL: https:                  retrievers, 2024. arXiv:2404.07220.
     //proceedings.mlr.press/v44/li15convergent.html.             [27] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ragas:
[14] S. Kornblith, M. Norouzi, H. Lee, G. Hinton, Sim-                 Automated evaluation of retrieval augmented genera-
     ilarity of neural network representations revisited,              tion, 2023. arXiv:2309.15217.
     in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceed-          [28] A. Gretton, O. Bousquet, A. Smola, B. SchΓΆlkopf, Mea-
     ings of the 36th International Conference on Ma-                  suring statistical dependence with hilbert-schmidt
     chine Learning, volume 97 of Proceedings of Machine               norms, in: S. Jain, H. U. Simon, E. Tomita (Eds.), Algo-
     rithmic Learning Theory, Springer Berlin Heidelberg,
     Berlin, Heidelberg, 2005, pp. 63–77.
[29] C. Wang, W. Rao, W. Guo, P. Wang, J. Liu, X. Guan,
     Towards understanding the instability of network em-
     bedding, IEEE Transactions on Knowledge and Data
     Engineering 34 (2022) 927–941. doi:10.1109/TKDE.
     2020.2989512.
[30] C. Inc., Chroma, Chroma Homepage, 2024. URL: https:
     //docs.trychroma.com/.
[31] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang,
     R. Majumder, F. Wei, Text embeddings by weakly-
     supervised contrastive pre-training, arXiv preprint
     arXiv:2212.03533 (2022).
[32] J. Ni, G. H. Ábrego, N. Constant, J. Ma, K. B. Hall,
     D. Cer, Y. Yang, Sentence-t5: Scalable sentence en-
     coders from pre-trained text-to-text models, 2021.
     arXiv:2108.08877.
[33] J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Ábrego, J. Ma, V. Y.
     Zhao, Y. Luan, K. B. Hall, M.-W. Chang, Y. Yang,
     Large dual encoders are generalizable retrievers, 2021.
     arXiv:2112.07899.
[34] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, C-pack:
     Packaged resources to advance general chinese em-
     bedding, 2023. arXiv:2309.07597.
[35] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang,
     Towards general text embeddings with multi-stage
     contrastive learning, arXiv preprint arXiv:2308.03281
     (2023).
[36] OpenAI, New embedding models with lower pric-
     ing, OpenAI Blog, 2024. URL: https://openai.com/blog/
     new-embedding-models-and-api-updates.
[37] Cohere, Embeddings - text embeddings with advanced
     language models, Cohere Homepage, 2024. URL: https:
     //cohere.com/embeddings.
[38] S. Lee, A. Shakir, D. Koenig, J. Lipp, Open source strikes
     bread - new fluffy embeddings model, 2024. URL: https:
     //www.mixedbread.ai/blog/mxbai-embed-large-v1.
[39] X. Li, J. Li, Angle-optimized text embeddings, arXiv
     preprint arXiv:2309.12871 (2023).
[40] R. Meng, Y. Liu, S. R. Joty, C. Xiong, Y. Zhou,
     S. Yavuz, Sfr-embedding-mistral:enhance text re-
     trieval with transfer learning, Salesforce AI Research
     Blog, 2024. URL: https://blog.salesforceairesearch.com/
     sfr-embedded-mistral/.
[41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
     napeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-
     learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
[42] M. L. Waskom, seaborn: statistical data visualiza-
     tion, Journal of Open Source Software 6 (2021)
     3021. URL: https://doi.org/10.21105/joss.03021. doi:10.
     21105/joss.03021.