Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems LauraCaspari laura.caspari@uni-passau.de University of Passau

Passau Germany

KanishkaGhosh Dastidar kanishka.ghoshdastidar@uni-passau.de University of Passau

Passau Germany

SaberZerhoudi saber.zerhoudi@uni-passau.de University of Passau

Passau Germany

JelenaMitrovic jelena.mitrovic@uni-passau.de University of Passau

Passau Germany

MichaelGranitzer michael.granitzer@uni-passau.de University of Passau

Passau Germany

ACM SIGIR Workshop on Information Retrieval's Role in RAG Systems

July 18 2024 Washington D.C USA

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems 1613-0073 BD400077D5A4E5A3AE41226FC76BA8C1 GROBID - A machine learning software for extracting information from scholarly documents Large language model, Retrieval-augmented generation, Model similarity (M. Granitzer) 0009-0002-6670-3211 (L. Caspari) 0000-0003-4171-0597 (K. G. Dastidar) 0000-0003-2259-0462 (S. Zerhoudi) 0000-0003-3220-8749 (J. Mitrovic) 0000-0003-3566-5507 (M. Granitzer)

The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-𝑘 retrieval similarity reveals high-variance at low 𝑘 values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.

Motivation

Retrieval-Augmented Generation (RAG) is an emerging paradigm that helps mitigate the problems of factual hallucination [1] and outdated training data [2] of large language models (LLMs) by providing these models with access to an external, non-parametric knowledge source (e.g. a document corpus). Central to the functioning of RAG frameworks is the retrieval step, wherein a small subset of candidate documents is retrieved from the document corpus, specific to the input query or prompt. This retrieval process, known as dense-retrieval, hinges on text embeddings. Typically, the generation of these embeddings is assigned to an LLM, for which there are several options due to the rapid evolution of the field. Consequently, selecting the most suitable embedding model from an array of available choices emerges as a critical aspect in the development of RAG systems. The information to guide this choice is currently primarily limited to architectural details (which are also on occasion scarce due to the prevalence of closed models) and performance benchmarks such as the Massive Text Embedding Benchmark (MTEB) [3].

We posit that an analysis of the similarity of the embeddings generated by these models would significantly aid this model selection process. Given the large number of candidates and ever increasing scale of the models, a fromscratch empirical evaluation of the embedding quality of these LLMs on a particular task can incur significant costs. This challenge becomes especially pronounced when dealing with large-scale corpora comprising potentially millions of documents. While the relative performance scores of these models on benchmark datasets offer the simplified perspective of comparing a single scalar value on an array of downstream tasks, such a view of model similarity might overlook the nuances of the relative behaviour of the models [4]. As an example, the absolute difference in precision@k between two retrieval systems only provides a weak indication of the overlap of retrieved results. We argue that identifying clusters of models with similar behaviour would allow practitioners to construct smaller, yet diverse candidate pools of models to evaluate. Beyond model selection, as highlighted by Klabunde et al., [5], such an analysis also facilitates the identification of common factors contributing to strong performance, easier model ensembling, and detection of potential instances of unauthorized model reuse.

In this paper, we analyze different LLMs in terms of the similarities of the embeddings they generate. Our similarity analysis serves as an unsupervised evaluation framework for these embedding models, in contrast to performance benchmarks that require labelled data. We do this from a dual perspective -we directly compare the embeddings using representational similarity measures. Additionally, we evaluate model similarity specifically in terms of their functional impact on RAG systems i.e. we look at how similar the retrieved results are. Our evaluation focuses on several prominent model families, to analyze similarities both within and across them. We also compare proprietary models (such as those by OpenAI or Cohere) to open-sourced ones in order to identify the most similar alternatives. Our experiments are carried out on five popular benchmark datasets to determine if similarities between models are influenced by the choice of data. Our code is available at https://github.com/casparil/embedding-model-similarity.

Table 1

The datasets used for generating embeddings with their number of queries and corpus size. technique used to find the linear relationship between two sets of variables by maximizing their correlation. Such comparisons using CCA or variants thereof can be found in several works [9], [10], [11]. Beyond CCA-based measures, other works have also explored computing correlations [12] and the mutual information [13] between neurons across networks. Kornblith et al. [14] propose Centered Kernel Alignment (CKA), which they show improves over several similarity measures in identifying corresponding layers of identical networks with different initializations. A diverse range of functional similarity evaluations have also been explored in the literature. A few examples include modelstitching [15], [16], [17], disagreement measures between output classes [18], [19], and quantifying the similarity between the class-wise output probabilities [20]. We would point the reader to the survey by Klabunde et al. [4] for a detailed overview of representational and functional similarity measures.

Recently, a few works have also focused on specifically evaluating the similarity of LLMs. While Wu et al. [21] evaluate language models along several perspectives, such as their representational and neuron-level similarities, their evaluation pre-dates the introduction of the recent wave of large scale models. Freestone and Santu [22] consider similarities of word embeddings, and evaluate if LLMs differ significantly to classical encoding models in terms of their representations. The works by Klabunde et al. [5] and Brown et al. [23] are more recent, and evaluate the representational similarity of LLMs, with the latter also considering the similarities between models of different sizes in the same model family.

Much of the literature on evaluation of LLM embeddings focuses on their performance on downstream tasks, with benchmarks such as BEIR [24] (for retrieval specifically) and MTEB [3] providing a unified view of embedding quality across metrics and datasets. The metrics used here mostly include typical information retrieval metrics such as precision, recall, and mean reciprocal rank at certain cutoffs. Some works specifically evaluate the retrieval components in a RAG context, where they either use a dataset outside of those included in the benchmarks [25] or where the evaluation encompasses other aspects of the retriever beyond the embedding model being used [26]. Another approach, that does not rely on ground-truth labels, is given by the Retrieval Augmented Generation Assessment (RAGAS) framework, which uses an LLM to determine the ratio of sentences in the retrieved context that are relevant to the answer being generated [27]. To the best of our knowledge, there are no works that evaluate the similarity of embedding models from a retrieval perspective.

Methods

We evaluate embedding model similarity using two approaches. The first directly compares the embeddings of text chunks generated by the models. The second approach is specific to the RAG context, where we evaluate the similarity of retrieved results for a given query. These approaches are discussed in detail in the following sections.

Pair-wise Embedding Similarity

There are several metrics defined in the literature that measure representational similarity [4]. Many of these metrics require the representation spaces of the embeddings to be compared to be aligned and/or the dimensionality of the embeddings across the models to be identical. To avoid these constraints, we pick Centered Kernel Alignment (CKA) [14] with a linear kernel as our similarity measure.

The measure computes similarity between two sets of embeddings in two steps. First, for a set of embeddings, the pair-wise similarity scores between all entries within this set are computed using the kernel function. Thus, row k of the resulting similarity matrix contains entries representing the similarity between embedding k and all other embeddings, including itself. Computing two such embedding similarity matrices for different models with the same number of embeddings then leads to two matrices E and E' of matching dimensions. These are compared directly in the second step with the Hilbert-Schmidt Independence Criterion (HSIC) [28] using the following formula:

𝐶𝐾𝐴(𝐸, 𝐸 ′ ) = 𝐻𝑆𝐼𝐶(𝐸,𝐸 ′ ) √ 𝐻𝑆𝐼𝐶(𝐸,𝐸)𝐻𝑆𝐼𝐶(𝐸 ′ ,𝐸 ′ )

The resulting similarity scores are bounded in the interval [0, 1] with a score of 1 indicating equivalent representations. CKA assumes that representations are mean-centered.

Retrieval Similarity

While a pair-wise comparison of embeddings offers insights into the similarities of the representations learned by these models, it does not suffice to quantify the similarities in outcomes when these embedding models are deployed for specific tasks. Therefore, in context of RAG systems, we consider the similarity of retrieved text chunks for a given query, when different embedding models are used. As a first step, for a given dataset, we generate embeddings of queries and document chunks with each of the embedding models. We then retrieve the 𝑘 most similar embeddings in terms of the cosine similarity for a particular query. As these embeddings correspond to specific chunks of text, we derive the sets of retrieved chunks C and C' for a pair of models. To measure the similarity of these sets, we use the Jaccard similarity coefficient as follows:

𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐶, 𝐶 ′ ) = |𝐶∩𝐶 ′ | |𝐶∪𝐶 ′ |

Here, |𝐶 ∩ 𝐶 ′ | corresponds to the overlap in text chunks by counting how often the two models retrieved the same chunks. Similarly, we can compute the union |𝐶 ∪ 𝐶 ′ |, which corresponds to all retrieved text chunks, counting chunks present in both sets only once. The resulting score is bounded in the interval [0, 1] with 1 indicating that both models retrieved the same set of text chunks.

While Jaccard similarity computes the percentage to which two sets overlap, it ignores the order in the sets. Rank similarity [29], on the other hand, considers the order of common elements, with closer elements having a higher impact on the score. The measure assigns ranks to common text chunks according to their similarity to the query, i.e. 𝑟𝐶 (𝑗) = 𝑛 if chunk 𝑗 was the top-𝑛 retrieved result for the query. Ranks are then compared using:

𝑅𝑎𝑛𝑘(𝑟𝐶 (𝑗), 𝑟 𝐶 ′ (𝑗)) = 2 (1+|𝑟 𝐶 (𝑗)−𝑟 𝐶 ′ (𝑗)|)(𝑟 𝐶 (𝑗)+𝑟 𝐶 ′ (𝑗))

With this, rank similarity for two sets of retrieved text chunks C, C' is calculated as:

𝑅𝑎𝑛𝑘𝑆𝑖𝑚(𝐶, 𝐶 ′ ) = 1 𝐻(|𝐶∩𝐶 ′ |) ∑︁ 𝑗∈|𝐶∩𝐶 ′ | 𝑅𝑎𝑛𝑘(𝑟𝐶 (𝑗), 𝑟 𝐶 ′ (𝑗)) with 𝐻(|𝐶 ∩ 𝐶 ′ |) = ∑︀ 𝐾=|𝐶∩𝐶 ′ | 𝑘=1 1

𝑘 denoting the K-th harmonic number, normalizing the score. Like the other measures, rank similarity is bounded in the interval [0, 1] with 1 indicating that all ranks are identical.

Experimental Setup

The following paragraphs describe our choice of datasets and models, along with details of the implementation of our experiments.

As we focus on the retrieval component of RAG systems, we select five publicly available datasets from the BEIR benchmark [24]. As generating embeddings for large datasets is a time-intensive process, especially for a larger number of models, we opt for five of the smaller datasets from the benchmark. This approach allows us to compare embeddings generated by a variety of models while at the same time allowing us to evaluate embedding similarity accross datasets. An overview of the datasets is shown in Table 1. For each dataset, we create embeddings by splitting documents into text chunks such that each chunk contains 256 tokens. The embedding vectors are stored with Chroma DB [30], an open source embedding database. For each vector, we additionally store information about the document and text chunk ids it encodes to be able to match embeddings generated by different models for evaluation.

For model selection, we primarily use publicly available models from the MTEB leaderboard [3]. We do not simply pick the best performing models on the leaderboard; instead, our choices are influenced by several factors. Firstly, we focus on analyzing similarities within and across model families and pick models belonging to the e5 [31], t5 [32,33], bge [34], and gte [35] families. Secondly, we recognize that it might be of interest to users to avoid pay-by-token policies of proprietary models by identifying similar opensource alternatives. Therefore, we pick high-performing proprietary models, two from OpenAI (text-embedding-3large and -small) [36] and one from Cohere (Cohere embedenglish-v3.0) [37]. We also compare the mxbai-embed-large-v1 (mxbai) [38] and UAE-Large-V1 (UAE) [39] models, that not only report very similar performances on MTEB, but also identical embedding dimensions, model size and memory usage. Finally, we include SFR-Embedding-Mistral (Mistral) [40] as the best-performing model on the leaderboard at the time of our experiments. A detailed overview of all selected models can be seen in Table 2.

To compare embedding similarity across models and datasets, we employ different strategies depending on the similarity measure. We apply CKA by retrieving all embeddings created by a model, matching embeddings using their document and text chunk ids and then computing their similarity for each of the five datasets. For Jaccard and rank similarity, we use sklearn's NearestNeighbor class [41] to determine the the top-𝑘 retrieval results. We compute Jaccard and rank scores per dataset, averaging over 25 queries. For the NFCorpus dataset, we calculate retrieval similarity for all possible 𝑘, i.e. using all embeddings generated for the dataset. As calculating similarity for each possible 𝑘 is computationally expensive, we did not repeat this for the remaining datasets and chose a smaller 𝑘 value instead. Furthermore, as only a limited number of results are to be provided as context to the generative model, ana- lyzing retrieval similarity at low 𝑘 values for e.g. top-10 is of most interest. As we are interested in identifying clusters of similar models, we also perform a hierarchical clustering on heatmap values using Seaborn [42]. The following section describes the results of our evaluation for the different measures.

Results

To evaluate how similar embeddings generated by different models are, we will first consider model families, checking if their pairwise and top-k similarity scores are highest within their family. Subsequently, we will identify the open source models which are most similar to our chosen proprietary models.

Intra-and Inter-Family Clusters

Comparing embeddings directly with CKA shows high similarity across most of the models, albeit with some variance. These scores allow us to identify certain clusters of models.

Figure 1 shows the pair-wise CKA scores of all models averaged across the five datasets. As expected, scores for most models are highest within their own family. This holds true for the gtr-t5, sentence-t5 and text-embedding-3 (OpenAI) models. Although the sentence-t5 and gtr-t5 models are closely related, they do not exhibit significantly higher similarity with each other compared to the remaining models.

From an inter-family perspective, we observe high similarity between the bge and gte models. For some models in these two families, interestingly, the highest similarity scores rather correspond to inter-family counterparts with matching embedding dimensions than with models in the same family. Specifically, gte-small reports the highest similarity to bge-small and gte-base to bge-base. On the other hand, gte-large shows slightly higher similarity to bge-base than bge-large and thus to a model with a lower embedding dimension. Another inter-family cluster is formed by the three models with the highest CKA scores overall, namely UAE, mxbai and bge-large, whose scores suggest almost perfect embedding similarity. In fact, the similarity score of bge-large to these two models is much higher than to other bge models.

Shifting our attention to top-𝑘 retrieval similarity, clusters vary depending on the 𝑘 value. Figure 3 illustrates how Jaccard similarity evolves over 𝑘 on NFCorpus. The first plot displays Jaccard scores between bge-large and all other models, while the second plot illustrates the scores for gtelarge. For extremely low 𝑘, we observe some peaks for nearly all models, followed by a noticeable drop in similarity. Of course, for larger 𝑘, the scores converge to one. Reaffirming our earlier observations with the CKA metric, bge-large demonstrates high retrieval similarity with UAE and mxbai. Similarity to the remaining models is much lower, with the highest scores for bge-base and bge-small for larger 𝑘. However, especially for small 𝑘, there is high variance in similarity score, with models from other families, e.g. Mistral or gte-large sometimes achieving higher scores than the bge models. A similar pattern can also be observed in the second plot, where Jaccard similarity for gte-large is highest within its family for larger 𝑘, but models like mxbai or bge-base sometimes reporting higher similarity for small 𝑘. Therefore, the clusters we identified through our CKA analysis are only truly reflected in these plots for large values of 𝑘. This suggest that in real-world use cases, where the top-𝑘 are crucial, such representational similarity measures might not provide the full picture. The plots for other model families provide nearly identical insights as those in the second plot in Figure 3 and thus we do not present them for sake of brevity.

For rank similarity, scores peak for small 𝑘 and then quickly start to drop until they approach a low stable score for larger 𝑘 as shown in Figure 2 for gte-large. Once again, the bge/UAE/mxbai inter-family cluster shows the highest similarity. In contrast to Jaccard similarity, the clusters that could be observed for CKA do not always show for rank similarity. As can be seen in Figure 2, the model with the highest rank similarity to gte-large is mxbai, rather than bge-large-en-v1.5_vs_SFR-Embedding-Mistral bge-large-en-v1.5_vs_UAE-Large-V1 bge-large-en-v1.5_vs_bge-base-en-v1.5 bge-large-en-v1.5_vs_bge-small-en-v1.5 bge-large-en-v1.5_vs_e5-base-v2 bge-large-en-v1.5_vs_e5-large-v2 bge-large-en-v1.5_vs_e5-small-v2 bge-large-en-v1.5_vs_embed-english-v3.0 bge-large-en-v1.5_vs_gte-base bge-large-en-v1.5_vs_gte-large bge-large-en-v1.5_vs_gte-small bge-large-en-v1.5_vs_gtr-t5-base bge-large-en-v1.5_vs_gtr-t5-large bge-large-en-v1.5_vs_mxbai-embed-large-v1 bge-large-en-v1.5_vs_sentence-t5-base bge-large-en-v1.5_vs_sentence-t5-large bge-large-en-v1.5_vs_text-embedding-3-large bge-large-en-v1.5_vs_text-embedding-3-small another gte model. Even so, the previously observed clusters also tend to appear for rank similarity, though they vary more depending on the models and dataset. Generally, scores for nearly all models are rather small for larger 𝑘, indicating low rank similarity. For small 𝑘, results vary more and differences between individual models are more pronounced.

As retrieval similarity at small 𝑘 is of most interest from a practical perspective, we take a closer look at top-10 Jaccard similarity. The heatmaps in Figures 4-6 show the top-10 Jac-card similarity between models across datasets. A striking insight here is that even the most similar models only report a Jaccard similarity of higher than 0.6, with the majority less than 0.5. This is of great relevance to practitioners, as it would imply that even using embeddings from models that report high representational similarity scores may yield little overlap in retrieved text chunks. As earlier, the cluster of UAE/mxbai/bge-large is the most prominent one with the highest scores. Intra-family scores tend to be the highest for most models, i.e. t5 and OpenAI. Depending on the .00 0.12 0.14 0.34 0.30 0.13 0.12 0.13 0.10 0.10 0.12 0.10 0.10 0.12 0.12 0.10 0.09 0.11 0.10 0.12 1.00 0.25 0.14 0.13 0.16 0.14 0.12 0.13 0.10 0.14 0.13 0.10 0.14 0.14 0.12 0.11 0.12 0.14 0.14 0.25 1.00 0.16 0.17 dataset, this also applies to gte and e5 models, although Jaccard similarity to models from other families is sometimes higher. We also note that for the two larger datasets FiQA-2018 and TREC-COVID, the similarity scores are generally substantially lower, as can be seen in Figure 6. For the smaller datasets, Jaccard similarity is generally higher, though results differ depending on the data (see Figures 4 and 5). Similar observations can be made for rank similarity, although the appearance of family clusters is more dependent on the dataset. Larger datasets also lead to lower scores. These results illustrate that while family clusters can still be perceived at small 𝑘, they are not as prominent as they are for larger 𝑘. Furthermore, the top-10 retrieved results differ substantially for most models and datasets and their similarity might be dependent on the dataset itself.

Open Source Alternatives to Proprietary Models

We explicitly included proprietary models in our analysis to check which open source models are the best -which in our case means the most similar -alternative. The CKA scores in Figure 1 indicate that embeddings generated by OpenAI's models (text-embedding-3-large/-small) are highly similar to those generated by Mistral, while the Cohere model (embedenglish-v3.0) demonstrates high similarity to e5-large-v2. These observations do not entirely extend to retrieval similarity, especially for Cohere. While Mistral is still the most similar model to OpenAI's for larger 𝑘 across all datasets, there is no consistently most similar model for Cohere. Rather, the model varies depending on the dataset and measure -Jaccard and rank similarity -sometimes showing highest similarity to e5-large-v2, but sometimes also to other models like Mistral. Taking a closer look at top-10 similarity, Mistral still largely exhibits the highest similarity to the OpenAI models, especially to text-embedding-3-large. For text-embedding-3-small, scores on all datasets are rather close to those of other models. In absolute terms, however, retrieval similarity between Mistral and OpenAI models is only low to moderate. On smaller datasets, the highest Jaccard similarity to text-embedding-3-large only reaches about 0.6 (see Figure 5), while on TREC-COVID, the largest dataset, Jaccard similarity goes down to merely 0.18 (see Figure 6). For Cohere's model, the most similar model for top-10 Jaccard similarity is different for each dataset, with the highest scores of 0.51 occurring on ArguAna shwon in Figure 5. For all proprietary models, even the best retrieval similarity at top-10 still suggests that the embeddings that would be presented to an LLM can differ notably. Once again, we could also observe dataset-dependent variance in scores, with lower retrieval similarity on larger datasets.

Discussion

While a pair-wise comparison of embeddings using CKA shows intra-and inter-family model clusters, retrieval similarity over different 𝑘 offers a more nuanced picture. Especially for small 𝑘, which are of most interest from a practical perspective, retrieval similarity varies. When comparing the top-10 retrieved text chunks, the low Jaccard similarity scores indicate little overlap in retrieved chunks, even when CKA scores are high. Especially for the two larger datasets FiQA-2018 and TREC-COVID, these scores are extremely low. As RAG systems usually operate on millions of embeddings, our datasets are comparatively small. Therefore, should a general trend of larger datasets leading to lower retrieval similarity exist, text chunks retrieved by different models in a regular use case might be nearly distinct for small 𝑘. Overall, our results suggest that even though embeddings seem rather similar when compared directly, retrieval performance can still vary substantially, is most unstable for 𝑘 values that are commonly used in RAG systems and also dataset-dependent. Retrieved chunks at small 𝑘 show the least overlap, leading to high differences in data that would be presented to an LLM as additional context. 1.00 0.11 0.11 0.12 0.09 0.13 0.13 0.10 0.14 0.12 0.12 0.12 0.15 0.12 0.11 0.11 0.25 0.18 0.11 1.00 0.20 0.52 0.16 0.12 0.11 0.10 0.21 0.35 0.22 0.12 0.11 0.64 0.13 0.09 0.15 0.17 0.11 0.20 1.00 0.17 0.19 0.13 0.11 0.09 0.26 0.20 0.16 0.10 0.10 0.20 0.11 0.07 0.11 0.14 0.12 0.52 0.17 1.00 0.12 0.13 0.10 0.09 0.19 0.28 0.17 0.12 0.11 0.49 0.12 0.10 0.15 0.17 0.09 0.16 0.19 0.12 1.00 0.10 0.10 0.11 0.14 0.14 0.24 0.11 0.10 0.15 0.08 0.06 0.10 0.10 0.13 0.12 0.13 0.13 0.10 1.00 0.19 0.18 0.15 0.13 0.13 0.10 0.10 0.13 0.09 0.08 0.14 0.15 0.13 0.11 0.11 0.10 0.10 0.19 1.00 0.16 0.10 0.09 0.09 0.10 0.11 0.10 0.08 0.06 0.11 0.13 0.10 0.10 0.09 0.09 0.11 0.18 0.16 1.00 0.07 0.08 0.09 0.07 0.08 0.09 0.07 0.07 0.07 0.08 0.14 0.21 0.26 0.19 0.14 0.15 0.10 0.07 1.00 0.29 0.24 0.12 0.11 0.25 0.12 0.10 0.15 0.18 0.12 0.35 0.20 0.28 0.14 0.13 0.09 0.08 0.29 1.00 0.27 0.12 0.10 0.40 0.14 0.10 0.17 0.19 0.12 0.22 0.16 0.17 0.24 0.13 0.09 0.09 0.24 0.27 1.00 0.10 0.10 0.24 0.12 0.08 0.16 0.17 0.12 0.12 0.10 0.12 0.11 0.10 0.10 0.07 0.12 0.12 0.10 1.00 0.27 0.12 0.16 0.12 0.14 0.14 0.15 0.11 0.10 0.11 0.10 0.10 0.11 0.08 0.11 0.10 0.10 0.27 1.00 0.13 0.13 0.14 0.16 0.15 0.12 0.64 0.20 0.49 0.15 0.13 0.10 0.09 0.25 0.40 0.24 0.12 0.13 1.00 0.15 0.12 0.16 0.18 0.11 0.13 0.11 0.12 0.08 0.09 0.08 0.07 0.12 0.14 0.12 0.16 0.13 0.15 1.00 0.21 0.13 0.13 0.11 0.09 0.07 0.10 0.06 0.08 0.06 0.07 0.10 0.10 0.08 0.12 0.14 0.12 0.21 1.00 0.11 0.12 0.25 0.15 0.11 0.15 0.10 0.14 0.11 0.07 0.15 0.17 0.16 0.14 0.16 0.16 0.13 0.11 1.00 0.31 0.18 0.17 0.14 0.17 0.10 0.15 0.13 0.08 0.18 0.19 0.17 0.14 0.15 0.18 0.13 0.12 0.31 1.00 1.00 0.08 0.12 0.10 0.08 0.12 0.06 0.04 0.09 0.07 0.08 0.09 0.11 0.08 0.05 0.05 0.18 0.18 0.08 1.00 0.23 0.54 0.25 0.11 0.07 0.04 0.13 0.21 0.14 0.12 0.11 0.71 0.10 0.08 0.09 0.10 0.12 0.23 1.00 0.18 0.20 0.14 0.08 0.05 0.24 0.19 0.16 0.09 0.11 0.24 0.11 0.10 0.12 0.15 0.10 0.54 0.18 1.00 0.21 0.10 0.06 0.03 0.12 0.14 0.12 0.13 0.13 0.51 0.09 0.07 0.09 0.11 0.08 0.25 0.20 0.21 1.00 0.11 0.09 0.05 0.12 0.18 0.20 0.11 0.09 0.23 0.09 0.05 0.09 0.09 0.12 0.11 0.14 0.10 0.11 1.00 0.22 0.16 0.09 0.12 0.11 0.09 0.10 0.12 0.08 0.05 0.13 0.15 0.06 0.07 0.08 0.06 0.09 0.22 1.00 0.22 0.04 0.08 0.07 0.04 0.05 0.07 0.05 0.03 0.07 0.06 0.04 0.04 0.05 0.03 0.05 0.16 0.22 1.00 0.02 0.05 0.05 0.04 0.04 0.05 0.05 0.03 0.04 0.05 0.09 0.13 0.24 0.12 0.12 0.09 0.04 0.02 1.00 0.23 0.21 0.08 0.08 0.15 0.09 0.05 0.12 0.17 0.07 0.21 0.19 0.14 0.18 0.12 0.08 0.05 0.23 1.00 0.28 0.07 0.07 0.23 0.10 0.06 0.12 0.13 0.08 0.14 0.16 0.12 0.20 0.11 0.07 0.05 0.21 0.28 1.00 0.07 0.08 0.17 0.10 0.07 0.11 0.12 0.09 0.12 0.09 0.13 0.11 0.09 0.04 0.04 0.08 0.07 0.07 1.00 0.26 0.13 0.10 0.09 0.09 0.10 0.11 0.11 0.11 0.13 0.09 0.10 0.05 0.04 0.08 0.07 0.08 0.26 1.00 0.12 0.08 0.08 0.10 0.14 0.08 0.71 0.24 0.51 0.23 0.12 0.07 0.05 0.15 0.23 0.17 0.13 0.12 1.00 0.12 0.09 0.09 0.11 0.05 0.10 0.11 0.09 0.09 0.08 0.05 0.05 0.09 0.10 0.10 0.10 0.08 0.12 1.00 0.23 0.07 0.07 0.05 0.08 0.10 0.07 0.05 0.05 0.03 0.03 0.05 0.06 0.07 0.09 0.08 0.09 0.23 1.00 0.06 0.07 0.18 0.09 0.12 0.09 0.09 0.13 0.07 0.04 0.12 0.12 0.11 0.09 0.10 0.09 0.07 0.06 1.00 0.29 0.18 0.10 0.15 0.11 0.09 0.15 0.06 0.05 0.17 0.13 0.12 0.10 0.14 0.11 0.07 0.07 0.29 1.00 Our analysis demonstrates that although models tend to be most similar to models from their own family, interfamily clusters exist. The most prominent of these clusters is formed by the models bge-large-en-v1.5, UAE-Large-V1 and mxbai-embed-large-v1, which demonstrate high similarity even for retrieval at low 𝑘. Nevertheless, the high variance of retrieval similarity of the remaining clusters for small 𝑘 means that while the identified clusters might provide some measure of orientation when choosing an embedding model, the choice still remains a non-trivial task. Identifying suitable alternatives to proprietary models is likewise not as simple. While we were able to determine SFR-Embedding-Mistral as the model being most similar to OpenAI's embedding models, Jaccard similarity at top-10 for larger datasets shows a low overlap in retrieved text chunks. Furthermore, for Cohere's embedding model, we were unable to find a single most similar model, as this model varied across datasets for small 𝑘 values.

Conclusion

In this paper we evaluated the similarity of embedding models on different datasets. Given the large number of available models, identifying clusters or families of models with similar embeddings can simplify the model selection process. While previous work on LLM similarity exists, to the best of the authors' knowledge, it so far lacks a clear focus on embedding models specifically in the context of RAG. We therefore analyzed the similarity of embeddings generated by 19 different models using CKA for pairwise comparison as well as Jaccard and rank similarity to compare retrieval behavior at top-𝑘 across five datasets. Comparing embeddings with CKA generally showed intra-and inter-family clusters across datasets. These clusters also appeared when evaluating top-𝑘 retrieval similarity with large 𝑘 values. However, scores for low 𝑘 values, which would commonly be chosen in RAG systems, show high variance and much lower similarity, especially on larger datasets. Although we were able to identify some model clusters, our results suggest that choosing the optimal model remains a non-trivial task that requires careful consideration.

Still, we argue that a better understanding of how similarly different embedding models behave is an important research topic that requires further attention. There are a plethora of real-world scenarios where RAG systems can potentially be deployed. One such scenario, for example, is to retrieve relevant web content in response to a query. As large corpora of such data are available in the form of Web ARChive (WARC) files, evaluating embedding model similarity on such large, uncleaned datasets would perhaps lead to a better estimation of model similarity for a realistic RAG use case. Additionally, as documents often need to be chunked into smaller parts to fit into the models, the effect of chunking strategies such as token-based or semantic chunking on embedding similarity could be explored. Furthermore, our evaluation focused on a small sample of similarity measures, with their own definition about which criteria make models similar. Introducing more measures with different perspectives could improve our understanding on which factors influence model similarity. Finally, our focus was on identifying clusters or families of models, which for our initial experiments led us to choosing families of embedding models with varying performance on MTEB. With the frequent appearance of new models on the leaderboard and the focus on good MTEB performance, it would be of interest to compare the best performing models on MTEB and check if their relative difference in performance correlates with how similar these models are.

Figure 1 :1Figure 1:Mean CKA similarity across all five datasets. Models tend to be most similar to models belonging to their own family, though some interesting inter-family patterns are visible as well.

Figure 2 :2Figure 2:Rank similarity over all on NFCorpus, comparing gte-large to all other models. Scores are highest and vary most for small 𝑘, but then drop quickly before stabilizing for larger 𝑘.

-Embedding-Mistral gte-large_vs_UAE-Large-V1 gte-large_vs_bge-base-en-v1.5 gte-large_vs_bge-large-en-v1.5 gte-large_vs_bge-small-en-v1.5 gte-large_vs_e5-base-v2 gte-large_vs_e5-large-v2 gte-large_vs_e5-small-v2 gte-large_vs_embed-english-v3.0 gte-large_vs_gte-base gte-large_vs_gte-small gte-large_vs_gtr-t5-base gte-large_vs_gtr-t5-large gte-large_vs_mxbai-embed-large-v1 gte-large_vs_sentence-t5-base gte-large_vs_sentence-t5-large gte-large_vs_text-embedding-3-large gte-large_vs_text-embedding-3-small (b)

Figure 3 :3Figure 3:Jaccard similarity over all 𝑘 on NFCorpus, comparing bge-large (a) and gte-large (b) to all other models. While bge-large shows high similarity to UAE-Large-v1 and mxbai-embed-large-v1, scores for gte-large are clustered much closer. Jaccard similarity seems to be most unstable for small values of 𝑘, which would commonly be chosen for retrieval tasks.

Figure 4 :4Figure 4: Jaccard (a) and rank similarity (b) for the top-10 retrieved text chunks averaged over 25 queries on NFCorpus.The clusters vary slightly depending on the measure, as do the scores. Models tend to be most similar to models from their own family. However, some inter-family clusters are visible as well.

5 e55UAE-Large-V1bge-base-en-v1.5 bge-large-en-v1.5 bge-small-en-v1.5 e5-base-v2 e5-large-v2 e5-small-v2 embed-english-v3.0 gte-base gte-large gte-small gtr-t5-base gtr-t5-large mxbai-embed-large-v1 sentence-t5-base sentence-t5-large text-embedding-3-large text-embedding-3-small SFR-Embedding-Mistral UAE-Large-V1 bge-base-en-v1.5 bge-large-en-v1.5bge-small-en-v1.

0.20 0.41 0.10 0.12 0.29 0.28 0.24 0.28 0.29 0.25 0.26 0.25 0.22 0.16 0.26 0.38 0.38 1.00 0.17 0.19 0.32 0.09 0.11 0.25 0.23 0.22 0.18 0.18 0.18 0.13 0.18 0.15 0.13 0.21 0.16 0.17 0.17 1.00 0.35 0.19 0.11 0.12 0.24 0.22 0.28 0.21 0.20 0.20 0.16 0.23 0.17 0.14 0.23 0.21 0.20 0.19 0.35 1.00 0.23 0.10 0.14 0.28 0.25 0.30 0.76 0.35 0.59 0.25 0.27 0.25 0.19 0.37 0.37 0.41 0.32 0.19 0.23 1

5 e5Figure 5 :55Figure 5: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on SciFact (a) and ArguAna (b). The UAE and mxbai models show high levels of similarity along with bge-large. The remaining models tend to show the highest similarity within their own family with the exception of the bge/gte inter-family cluster.

5 e55UAE-Large-V1bge-base-en-v1.5 bge-large-en-v1.5 bge-small-en-v1.5 e5-base-v2 e5-large-v2 e5-small-v2 gte-base gte-large gte-small gtr-t5-base gtr-t5-large mxbai-embed-large-v1 sentence-t5-base sentence-t5-large text-embedding-3-large text-embedding-3-small SFR-Embedding-Mistral UAE-Large-V1 bge-base-en-v1.5 bge-large-en-v1.5bge-small-en-v1.

5 e55bge-base-en-v1.5 bge-large-en-v1.5 bge-small-en-v1.5 e5-base-v2 e5-large-v2 e5-small-v2 gte-base gte-large gte-small gtr-t5-base gtr-t5-large mxbai-embed-large-v1 sentence-t5-base sentence-t5-large text-embedding-3-large text-embedding-3-small SFR-Embedding-Mistral UAE-Large-V1 bge-base-en-v1.5 bge-large-en-v1.5bge-small-en-v1.

Figure 6 :6Figure 6: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on FiQA-2018 (a) and TREC-COVID (b). Most models seem to retrieve almost completely distinct text chunks. Only the bge/UAE/mxbai cluster still shows a notable level of similarity, while the scores of the remaining clusters indicate only moderate to low levels of similarity.

Table 22We compare a diverse set of open source models from different families as well as proprietary models with varying performance on MTEB.ModelEmbedding dimension Max. Tokens MTEB Average Open SourceSFR-Embedding-Mistral40963276867.56✓mxbai-embed-large-v1102451264.68✓UAE-Large-V1102451264.64✓text-embedding-3-large3072819164.59✗Cohere embed-english-v3.0102451264.47✗bge-large-en-v1.5102451264.23✓bge-base-en-v1.576851263.55✓gte-large102451263.13✓gte-base76851262.39✓text-embedding-3-small1536819162.26✗e5-large-v2102451262.25✓bge-small-en-v1.538451262.17✓e5-base-v276851261.5✓gte-small38451261.36✓e5-small-v238451259.93✓gtr-t5-large76851258.28✓sentence-t5-large76851257.06✓gtr-t5-base76851256.19✓sentence-t5-base76851255.27✓

1.00 0.29 0.32 0.30 0.21 0.29 0.24 0.19 0.34 0.32 0.26 0.24 0.22 0.28 0.30 0.13 0.16 0.47 0.38 0.29 1.00 0.36 0.59 0.23 0.27 0.23 0.19 0.35 0.35 0.36 0.28 0.18 0.21 0.76 0.12 0.14 0.31 0.29 0.32 0.36 1.00 0.33 0.26 0.28 0.24 0.19 0.37 0.44 0.30 0.29 0.18 0.20 0.35 0.13 0.12 0.32 0.34 0.30 0.59 0.33 1.00 0.24 0.23 0.22 0.16 0.35 0.35 0.31 0.25 0.18 0.20 0.59 0.10 0.13 0.30 0.30 0.21 0.23 0.26 0.24 1.00 0.19 0.17 0.13 0.26 0.25 0.21 0.26 0.13 0.16 0.25 0.10 0.10 0.23 0.20 0.29 0.27 0.28 0.23 0.19 1.00 0.34 0.24 0.34 0.27 0.26 0.25 0.18 0.23 0.27 0.12 0.14 0.29 0.32 0.24 0.23 0.24 0.22 0.17 0.34 1.00 0.24 0.30 0.22 0.24 0.22 0.15 0.17 0.25 0.10 0.13 0.23 0.25 0.19 0.19 0.19 0.16 0.13 0.24 0.24 1.00 0.19 0.17 0.18 0.16 0.13 0.14 0.19 0.10 0.10 0.22 0.21 0.34 0.35 0.37 0.35 0.26 0.34 0.30 0.19 1.00 0.30 0.29 0.26 0.21 0.23 0.37 0.12 0.14 0.33 0.36 0.32 0.35 0.44 0.35 0.25 0.27 0.22 0.17 0.30 1.00 0.43 0.38 0.16 0.21 0.37 0.12 0.14 0.33 0.33 0.26 0.36 0.30 0.31 0.21 0.26 0.24 0.18 0.29 0.43 1.00 0.38 0.17

Acknowledgments

This work has received funding from the European Union's Horizon Europe research and innovation program under grant agreement No 101070014 (OpenWebSearch.EU, https: //doi.org/10.3030/101070014).

Survey of hallucination in natural language generation ZJi NLee RFrieske TYu DSu YXu EIshii YJBang AMadotto PFung ACM Computing Surveys 55 2023 Is your llm outdated? benchmarking llms & alignment algorithms for time-sensitive knowledge SMMousavi SAlghisi GRiccardi arXiv:2404.08700 2024 NMuennighoff NTazi LMagne NReimers arXiv:2210.07316 Mteb: Massive text embedding benchmark 2023 MKlabunde TSchumacher MStrohmaier FLemmerich arXiv:2305.06329 Similarity of neural network models: A survey of functional and representational measures 2023 arXiv preprint MKlabunde MBAmor MGranitzer FLemmerich arXiv:2312.02730 Towards measuring representational similarity of large language models 2023 arXiv preprint Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability MRaghu JGilmer JYosinski JSohl-Dickstein Advances in neural information processing systems 30 2017 Insights on representational similarity in neural networks with canonical correlation AMorcos MRaghu SBengio Advances in neural information processing systems 31 2018 Canonical correlation analysis: An overview with application to learning methods DRHardoon SSzedmak JShawe-Taylor Neural computation 16 2004 Grounding representation similarity through statistical testing FDing J.-SDenain JSteinhardt Advances in Neural Information Processing Systems 34 2021 On the similarity between hidden layers of pruned and unpruned convolutional neural networks MZullich FPellegrino EMedvet AAnsuini Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods the 9th International Conference on Pattern Recognition Applications and Methods Scitepress 2020 Inner product-based neural network similarity WChen ZMiao QQiu Advances in Neural Information Processing Systems 36 2024 Convergent learning: Do different neural networks learn the same representations? YLi JYosinski JClune HLipson JHopcroft arXiv:1511.07543 2016 Convergent learning: Do different neural networks learn the same representations? YLi JYosinski JClune HLipson JHopcroft Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015 DStorcheus ARostamizadeh SKumar the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015

PMLR, Montreal, Canada

2015 44 Proceedings of Machine Learning Research Similarity of neural network representations revisited SKornblith MNorouzi HLee GHinton Proceedings of the 36th International Conference on Machine Learning KChaudhuri RSalakhutdinov the 36th International Conference on Machine Learning

PMLR

2019 97 Proceedings of Machine Learning Research Revisiting model stitching to compare neural representations YBansal PNakkiran BBarak Advances in Neural Information Processing Systems MRanzato ABeygelzimer YDauphin PLiang JWVaughan Curran Associates, Inc 2021 34 Understanding image representations by measuring their equivariance and equivalence KLenc AVedaldi Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2015 On the functional similarity of robust and non-robust neural representations ABalogh MJelasity Proceedings of the 40th International Conference on Machine Learning AKrause EBrunskill KCho BEngelhardt SSabato JScarlett the 40th International Conference on Machine Learning

PMLR

2023 202 Proceedings of Machine Learning Research Launch and iterate: Reducing prediction churn MMilani Fard QCormier KCanini MGupta Advances in Neural Information Processing Systems 29 2016 Diffchaser: Detecting disagreements for deep neural networks XXie LMa HWang YLi YLiu XLi ternational Joint Conferences on Artificial Intelligence Organization 2019 Modeldiff: testing-based dnn similarity comparison for model reuse detection YLi ZZhang BLiu ZYang YLiu 10.1145/3460319.3464816 doi:10.1145/ 3460319.3464816 Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA '21 the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA '21 ACM 2021 JMWu YBelinkov HSajjad NDurrani FDalvi JGlass arXiv:2005.01172 Similarity analysis of contextual word representation models 2020 Word embeddings revisited: Do llms offer something new? MFreestone SK KSantu arXiv:2402.11094 2024 DBrown CGodfrey NKonz JTu HKvinge arXiv:2310.14993 Understanding the inner workings of language models through representation dissimilarity 2023 NThakur NReimers ARücklé ASrivastava IGurevych arXiv:2104.08663 Beir: A heterogenous benchmark for zeroshot evaluation of information retrieval models 2021 PFinardi LAvila RCastaldoni PGengo CLarcher MPiau PCosta VCaridá arXiv:2401.07883 The chronicles of rag: The retriever, the chunk and the generator 2024 Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers KSawarkar AMangal SRSolanki arXiv:2404.07220 2024 SEs JJames LEspinosa-Anke SSchockaert arXiv:2309.15217 Ragas: Automated evaluation of retrieval augmented generation 2023 Measuring statistical dependence with hilbert-schmidt norms AGretton OBousquet ASmola BSchölkopf Algorithmic Learning Theory SJain HUSimon ETomita

Berlin Heidelberg; Berlin, Heidelberg

Springer 2005 Towards understanding the instability of network embedding CWang WRao WGuo PWang JLiu XGuan 10.1109/TKDE.2020.2989512 IEEE Transactions on Knowledge and Data Engineering 34 2022 CInc Chroma, Chroma Homepage 2024 LWang NYang XHuang BJiao LYang DJiang RMajumder FWei arXiv:2212.03533 Text embeddings by weaklysupervised contrastive pre-training 2022 arXiv preprint JNi GHÁbrego NConstant JMa KBHall DCer YYang arXiv:2108.08877 Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models 2021 Large dual encoders are generalizable retrievers JNi CQu JLu ZDai GHÁbrego JMa VYZhao YLuan KBHall M.-WChang YYang arXiv:2112.07899 2021 SXiao ZLiu PZhang NMuennighoff arXiv:2309.07597 C-pack: Packaged resources to advance general chinese embedding 2023 ZLi XZhang YZhang DLong PXie MZhang arXiv:2308.03281 Towards general text embeddings with multi-stage contrastive learning 2023 arXiv preprint Openai New embedding models with lower pricing, OpenAI Blog 2024 Cohere Embeddings -text embeddings with advanced language models 2024 Cohere Homepage SLee AShakir DKoenig JLipp Open source strikes bread -new fluffy embeddings model 2024 XLi JLi arXiv:2309.12871 Angle-optimized text embeddings 2023 arXiv preprint RMeng YLiu SRJoty CXiong YZhou SYavuz Sfr-embedding-mistral:enhance text retrieval with transfer learning, Salesforce AI Research Blog 2024 Scikitlearn: Machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 seaborn: statistical data visualization MLWaskom Journal of Open Source Software 6 2021 <idno type="DOI">10.21105/joss.03021</idno> <ptr target="https://doi.org/10.21105/joss.03021.doi:10.21105/joss.03021" /> <imprint/> </monogr> </biblStruct> </listBibl> </div> </back> </text> </TEI>