1. Motivation

Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems

Laura Caspari

laura.caspari@uni-passau.de 0

Kanishka Ghosh Dastidar

kanishka.ghoshdastidar@uni-passau.de 0

Saber Zerhoudi

saber.zerhoudi@uni-passau.de 0

Jelena Mitrovic

jelena.mitrovic@uni-passau.de 0

Michael Granitzer

michael.granitzer@uni-passau.de 0 0 University of Passau , Passau , Germany

The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare diferent families of embedding models, including proprietary ones, across ifve datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top- retrieval similarity reveals high-variance at low values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.

eol>Large language model Retrieval-augmented generation Model similarity

1. Motivation

Retrieval-Augmented Generation (RAG) is an emerging paradigm that helps mitigate the problems of factual hallucination [ 1 ] and outdated training data [ 2 ] of large language models (LLMs) by providing these models with access to an external, non-parametric knowledge source (e.g. a document corpus). Central to the functioning of RAG frameworks is the retrieval step, wherein a small subset of candidate documents is retrieved from the document corpus, specific to the input query or prompt. This retrieval process, known as dense-retrieval, hinges on text embeddings. Typically, the generation of these embeddings is assigned to an LLM, for which there are several options due to the rapid evolution of the field. Consequently, selecting the most suitable embedding model from an array of available choices emerges as a critical aspect in the development of RAG systems. The information to guide this choice is currently primarily limited to architectural details (which are also on occasion scarce due to the prevalence of closed models) and performance benchmarks such as the Massive Text Embedding Benchmark (MTEB) [ 3 ].

We posit that an analysis of the similarity of the embeddings generated by these models would significantly aid this model selection process. Given the large number of candidates and ever increasing scale of the models, a fromscratch empirical evaluation of the embedding quality of these LLMs on a particular task can incur significant costs. This challenge becomes especially pronounced when dealing with large-scale corpora comprising potentially millions of documents. While the relative performance scores of these models on benchmark datasets ofer the simplified perspective of comparing a single scalar value on an array of downstream tasks, such a view of model similarity might overlook the nuances of the relative behaviour of the models [ 4 ]. As an example, the absolute diference in precision@k between two retrieval systems only provides a weak indication of the overlap of retrieved results. We argue that identifying clusters of models with similar behaviour would allow practitioners to construct smaller, yet diverse candidate pools of models to evaluate. Beyond model selection, as highlighted by Klabunde et al., [ 5 ], such an analysis also facilitates the identification of common factors contributing to strong performance, easier model ensembling, and detection of potential instances of unauthorized model reuse.

In this paper, we analyze diferent LLMs in terms of the similarities of the embeddings they generate. Our similarity analysis serves as an unsupervised evaluation framework for these embedding models, in contrast to performance benchmarks that require labelled data. We do this from a dual perspective - we directly compare the embeddings using representational similarity measures. Additionally, we evaluate model similarity specifically in terms of their functional impact on RAG systems i.e. we look at how similar the retrieved results are. Our evaluation focuses on several prominent model families, to analyze similarities both within and across them. We also compare proprietary models (such as those by OpenAI or Cohere) to open-sourced ones in order to identify the most similar alternatives. Our experiments are carried out on five popular benchmark datasets to determine if similarities between models are influenced by the choice of data. Our code is available at https://github.com/casparil/embedding-model-similarity.

2. Related Work

Studies evaluating similarities of neural networks fall into two main categories: the first involves comparing activations of diferent models generated at any pair of layers for a specific input (representational similarity), while the second compares the model outputs (functional similarity). Raghu et al. [ 6 ] and Morcos et al. [ 7 ] propose measures building on Canonical Correlation Analysis (CCA) [ 8 ], a statistical technique used to find the linear relationship between two sets of variables by maximizing their correlation. Such comparisons using CCA or variants thereof can be found in several works [ 9 ], [ 10 ], [ 11 ]. Beyond CCA-based measures, other works have also explored computing correlations [ 12 ] and the mutual information [ 13 ] between neurons across networks. Kornblith et al. [ 14 ] propose Centered Kernel Alignment (CKA), which they show improves over several similarity measures in identifying corresponding layers of identical networks with diferent initializations. A diverse range of functional similarity evaluations have also been explored in the literature. A few examples include modelstitching [ 15 ], [ 16 ], [ 17 ], disagreement measures between output classes [ 18 ], [ 19 ], and quantifying the similarity between the class-wise output probabilities [ 20 ]. We would point the reader to the survey by Klabunde et al. [ 4 ] for a detailed overview of representational and functional similarity measures.

Recently, a few works have also focused on specifically evaluating the similarity of LLMs. While Wu et al. [ 21 ] evaluate language models along several perspectives, such as their representational and neuron-level similarities, their evaluation pre-dates the introduction of the recent wave of large scale models. Freestone and Santu [ 22 ] consider similarities of word embeddings, and evaluate if LLMs differ significantly to classical encoding models in terms of their representations. The works by Klabunde et al. [ 5 ] and Brown et al. [ 23 ] are more recent, and evaluate the representational similarity of LLMs, with the latter also considering the similarities between models of diferent sizes in the same model family.

Much of the literature on evaluation of LLM embeddings focuses on their performance on downstream tasks, with benchmarks such as BEIR [ 24 ] (for retrieval specifically) and MTEB [ 3 ] providing a unified view of embedding quality across metrics and datasets. The metrics used here mostly include typical information retrieval metrics such as precision, recall, and mean reciprocal rank at certain cutofs. Some works specifically evaluate the retrieval components in a RAG context, where they either use a dataset outside of those included in the benchmarks [ 25 ] or where the evaluation encompasses other aspects of the retriever beyond the embedding model being used [ 26 ]. Another approach, that does not rely on ground-truth labels, is given by the Retrieval Augmented Generation Assessment (RAGAS) framework, which uses an LLM to determine the ratio of sentences in the retrieved context that are relevant to the answer being generated [ 27 ]. To the best of our knowledge, there are no works that evaluate the similarity of embedding models from a retrieval perspective.

3. Methods

We evaluate embedding model similarity using two approaches. The first directly compares the embeddings of text chunks generated by the models. The second approach is specific to the RAG context, where we evaluate the similarity of retrieved results for a given query. These approaches are discussed in detail in the following sections.

3.1. Pair-wise Embedding Similarity

There are several metrics defined in the literature that measure representational similarity [ 4 ]. Many of these metrics require the representation spaces of the embeddings to be compared to be aligned and/or the dimensionality of the embeddings across the models to be identical. To avoid these constraints, we pick Centered Kernel Alignment (CKA) [ 14 ] with a linear kernel as our similarity measure.

The measure computes similarity between two sets of embeddings in two steps. First, for a set of embeddings, the pair-wise similarity scores between all entries within this set are computed using the kernel function. Thus, row k of the resulting similarity matrix contains entries representing the similarity between embedding k and all other embeddings, including itself. Computing two such embedding similarity matrices for diferent models with the same number of embeddings then leads to two matrices E and E’ of matching dimensions. These are compared directly in the second step with the Hilbert-Schmidt Independence Criterion (HSIC) [ 28 ] using the following formula: (,′) (, ′) = √(,)(′,′)

The resulting similarity scores are bounded in the interval [ 0, 1 ] with a score of 1 indicating equivalent representations. CKA assumes that representations are mean-centered.

3.2. Retrieval Similarity

While a pair-wise comparison of embeddings ofers insights into the similarities of the representations learned by these models, it does not sufice to quantify the similarities in outcomes when these embedding models are deployed for specific tasks. Therefore, in context of RAG systems, we consider the similarity of retrieved text chunks for a given query, when diferent embedding models are used. As a ifrst step, for a given dataset, we generate embeddings of queries and document chunks with each of the embedding models. We then retrieve the most similar embeddings in terms of the cosine similarity for a particular query. As these embeddings correspond to specific chunks of text, we derive the sets of retrieved chunks C and C’ for a pair of models. To measure the similarity of these sets, we use the Jaccard similarity coeficient as follows: (, ′) = |∩′| |∪′|

Here, | ∩ ′| corresponds to the overlap in text chunks by counting how often the two models retrieved the same chunks. Similarly, we can compute the union | ∪ ′|, which corresponds to all retrieved text chunks, counting chunks present in both sets only once. The resulting score is bounded in the interval [ 0, 1 ] with 1 indicating that both models retrieved the same set of text chunks.

While Jaccard similarity computes the percentage to which two sets overlap, it ignores the order in the sets. Rank similarity [ 29 ], on the other hand, considers the order of common elements, with closer elements having a higher impact on the score. The measure assigns ranks to common text chunks according to their similarity to the query, i.e. () = if chunk was the top- retrieved result for the query. Ranks are then compared using: 2 ( (), ′ ()) = (1+| ()− ′ ()|)( ()+′ ())

With this, rank similarity for two sets of retrieved text chunks C, C’ is calculated as:

∑︁ ∈|∩′| (, ′) = (|1∩′|)

( (), ′ ()) with (| ∩ ′|) = ∑︀=|∩′| 1 denoting the K-th =1 harmonic number, normalizing the score. Like the other measures, rank similarity is bounded in the interval [ 0, 1 ] with 1 indicating that all ranks are identical.

4. Experimental Setup

The following paragraphs describe our choice of datasets and models, along with details of the implementation of our experiments.

As we focus on the retrieval component of RAG systems, we select five publicly available datasets from the BEIR benchmark [ 24 ]. As generating embeddings for large datasets is a time-intensive process, especially for a larger number of models, we opt for five of the smaller datasets from the benchmark. This approach allows us to compare embeddings generated by a variety of models while at the same time allowing us to evaluate embedding similarity accross datasets. An overview of the datasets is shown in Table 1. For each dataset, we create embeddings by splitting documents into text chunks such that each chunk contains 256 tokens. The embedding vectors are stored with Chroma DB [ 30 ], an open source embedding database. For each vector, we additionally store information about the document and text chunk ids it encodes to be able to match embeddings generated by diferent models for evaluation.

For model selection, we primarily use publicly available models from the MTEB leaderboard [ 3 ]. We do not simply pick the best performing models on the leaderboard; instead, our choices are influenced by several factors. Firstly, we focus on analyzing similarities within and across model families and pick models belonging to the e5 [ 31 ], t5 [ 32, 33 ], bge [ 34 ], and gte [ 35 ] families. Secondly, we recognize that it might be of interest to users to avoid pay-by-token policies of proprietary models by identifying similar opensource alternatives. Therefore, we pick high-performing proprietary models, two from OpenAI (text-embedding-3large and -small) [ 36 ] and one from Cohere (Cohere embedenglish-v3.0) [ 37 ]. We also compare the mxbai-embed-largev1 (mxbai) [ 38 ] and UAE-Large-V1 (UAE) [ 39 ] models, that not only report very similar performances on MTEB, but also identical embedding dimensions, model size and memory usage. Finally, we include SFR-Embedding-Mistral (Mistral) [ 40 ] as the best-performing model on the leaderboard at the time of our experiments. A detailed overview of all selected models can be seen in Table 2.

To compare embedding similarity across models and datasets, we employ diferent strategies depending on the similarity measure. We apply CKA by retrieving all embeddings created by a model, matching embeddings using their document and text chunk ids and then computing their similarity for each of the five datasets. For Jaccard and rank similarity, we use sklearn’s NearestNeighbor class [ 41 ] to determine the the top- retrieval results. We compute Jaccard and rank scores per dataset, averaging over 25 queries. For the NFCorpus dataset, we calculate retrieval similarity for all possible , i.e. using all embeddings generated for the dataset. As calculating similarity for each possible is computationally expensive, we did not repeat this for the remaining datasets and chose a smaller value instead. Furthermore, as only a limited number of results are to be provided as context to the generative model, analyzing retrieval similarity at low values for e.g. top-10 is of most interest. As we are interested in identifying clusters of similar models, we also perform a hierarchical clustering on heatmap values using Seaborn [ 42 ]. The following section describes the results of our evaluation for the diferent measures.

5. Results

To evaluate how similar embeddings generated by diferent models are, we will first consider model families, checking if their pairwise and top-k similarity scores are highest within their family. Subsequently, we will identify the open source models which are most similar to our chosen proprietary models.

5.1. Intra- and Inter-Family Clusters

Comparing embeddings directly with CKA shows high similarity across most of the models, albeit with some variance. These scores allow us to identify certain clusters of models. Figure 1 shows the pair-wise CKA scores of all models averaged across the five datasets. As expected, scores for most models are highest within their own family. This holds true for the gtr-t5, sentence-t5 and text-embedding-3 (OpenAI) models. Although the sentence-t5 and gtr-t5 models are closely related, they do not exhibit significantly higher similarity with each other compared to the remaining models.

From an inter-family perspective, we observe high similarity between the bge and gte models. For some models in these two families, interestingly, the highest similarity scores rather correspond to inter-family counterparts with matching embedding dimensions than with models in the same family. Specifically, gte-small reports the highest similarity to bge-small and gte-base to bge-base. On the other hand, gte-large shows slightly higher similarity to bge-base than bge-large and thus to a model with a lower embedding dimension. Another inter-family cluster is formed by the three models with the highest CKA scores overall, namely 0.6 0.5 0.4 0.3 0.2 0.1 0.6 0.4 0.2 0.00 10 20 30 40 50 0 1000 2000 3000 4000 5000 6000 UAE, mxbai and bge-large, whose scores suggest almost perfect embedding similarity. In fact, the similarity score of bge-large to these two models is much higher than to other bge models.

Shifting our attention to top- retrieval similarity, clusters vary depending on the value. Figure 3 illustrates how Jaccard similarity evolves over on NFCorpus. The first plot displays Jaccard scores between bge-large and all other models, while the second plot illustrates the scores for gtelarge. For extremely low , we observe some peaks for nearly all models, followed by a noticeable drop in similarity. Of course, for larger , the scores converge to one. Reafirming our earlier observations with the CKA metric, bge-large demonstrates high retrieval similarity with UAE and mxbai. Similarity to the remaining models is much lower, with the highest scores for bge-base and bge-small for larger . However, especially for small , there is high variance in similarity score, with models from other families, e.g. Mistral or gte-large sometimes achieving higher scores than the bge models. A similar pattern can also be observed in the second plot, where Jaccard similarity for gte-large is highest within its family for larger , but models like mxbai or bge-base sometimes reporting higher similarity for small . Therefore, the clusters we identified through our CKA analysis are only truly reflected in these plots for large values of . This suggest that in real-world use cases, where the top- are crucial, such representational similarity measures might not provide the full picture. The plots for other model families provide nearly identical insights as those in the second plot in Figure 3 and thus we do not present them for sake of brevity.

For rank similarity, scores peak for small and then quickly start to drop until they approach a low stable score for larger as shown in Figure 2 for gte-large. Once again, the bge/UAE/mxbai inter-family cluster shows the highest similarity. In contrast to Jaccard similarity, the clusters that could be observed for CKA do not always show for rank similarity. As can be seen in Figure 2, the model with the highest rank similarity to gte-large is mxbai, rather than 1.0 0.8 0.6 0.4 0.2 1.0 0.8 0.6 0.4 3000 (a) 0.8 0.6 0.4 0.2 0.00 4000 3000 (b) 0.6 0.5 0.4 0.3 0.2 0.1 0.00 4000 0 1000 2000 6000 0 1000 2000 10

20 5000 30 40 50 10

20 5000 30 40

50 6000 another gte model. Even so, the previously observed clusters also tend to appear for rank similarity, though they vary more depending on the models and dataset. Generally, scores for nearly all models are rather small for larger , indicating low rank similarity. For small , results vary more and diferences between individual models are more pronounced.

As retrieval similarity at small is of most interest from a practical perspective, we take a closer look at top-10 Jaccard similarity. The heatmaps in Figures 4-6 show the top-10 Jaccard similarity between models across datasets. A striking insight here is that even the most similar models only report a Jaccard similarity of higher than 0.6, with the majority less than 0.5. This is of great relevance to practitioners, as it would imply that even using embeddings from models that report high representational similarity scores may yield little overlap in retrieved text chunks. As earlier, the cluster of UAE/mxbai/bge-large is the most prominent one with the highest scores. Intra-family scores tend to be the highest for most models, i.e. t5 and OpenAI. Depending on the dataset, this also applies to gte and e5 models, although Jaccard similarity to models from other families is sometimes higher. We also note that for the two larger datasets FiQA-2018 and TREC-COVID, the similarity scores are generally substantially lower, as can be seen in Figure 6. For the smaller datasets, Jaccard similarity is generally higher, though results difer depending on the data (see Figures 4 and 5). Similar observations can be made for rank similarity, although the appearance of family clusters is more dependent on the dataset. Larger datasets also lead to lower scores. These results illustrate that while family clusters can still be perceived at small , they are not as prominent as they are for larger . Furthermore, the top-10 retrieved results difer substantially for most models and datasets and their similarity might be dependent on the dataset itself.

5.2. Open Source Alternatives to Proprietary Models

We explicitly included proprietary models in our analysis to check which open source models are the best - which in our case means the most similar - alternative. The CKA scores in Figure 1 indicate that embeddings generated by OpenAI’s models (text-embedding-3-large/-small) are highly similar to those generated by Mistral, while the Cohere model (embedenglish-v3.0) demonstrates high similarity to e5-large-v2.

These observations do not entirely extend to retrieval similarity, especially for Cohere. While Mistral is still the most similar model to OpenAI’s for larger across all datasets, there is no consistently most similar model for Cohere. Rather, the model varies depending on the dataset and measure - Jaccard and rank similarity - sometimes showing highest similarity to e5-large-v2, but sometimes also to other models like Mistral. Taking a closer look at top-10 similarity, Mistral still largely exhibits the highest similarity to the OpenAI models, especially to text-embedding-3-large. For text-embedding-3-small, scores on all datasets are rather close to those of other models. In absolute terms, however, retrieval similarity between Mistral and OpenAI models is only low to moderate. On smaller datasets, the highest Jaccard similarity to text-embedding-3-large only reaches about 0.6 (see Figure 5), while on TREC-COVID, the largest dataset, Jaccard similarity goes down to merely 0.18 (see Figure 6). For Cohere’s model, the most similar model for top-10 Jaccard similarity is diferent for each dataset, with the highest scores of 0.51 occurring on ArguAna shwon in Figure 5. For all proprietary models, even the best retrieval similarity at top-10 still suggests that the embeddings that would be presented to an LLM can difer notably. Once again, we could also observe dataset-dependent variance in scores, with lower retrieval similarity on larger datasets.

6. Discussion

While a pair-wise comparison of embeddings using CKA shows intra- and inter-family model clusters, retrieval similarity over diferent ofers a more nuanced picture. Especially for small , which are of most interest from a practical perspective, retrieval similarity varies. When comparing the top-10 retrieved text chunks, the low Jaccard similarity scores indicate little overlap in retrieved chunks, even when CKA scores are high. Especially for the two larger datasets FiQA-2018 and TREC-COVID, these scores are extremely low. As RAG systems usually operate on millions of embeddings, our datasets are comparatively small. Therefore, should a general trend of larger datasets leading to lower retrieval similarity exist, text chunks retrieved by diferent models in a regular use case might be nearly distinct for small . Overall, our results suggest that even though embeddings seem rather similar when compared directly, retrieval performance can still vary substantially, is most unstable for values that are commonly used in RAG systems and also dataset-dependent. Retrieved chunks at small show the least overlap, leading to high diferences in data that would be presented to an LLM as additional context.

Our analysis demonstrates that although models tend to be most similar to models from their own family, interfamily clusters exist. The most prominent of these clusters is formed by the models bge-large-en-v1.5, UAE-Large-V1 and mxbai-embed-large-v1, which demonstrate high similarity even for retrieval at low . Nevertheless, the high variance of retrieval similarity of the remaining clusters for small means that while the identified clusters might provide some measure of orientation when choosing an embedding model, the choice still remains a non-trivial task. Identifying suitable alternatives to proprietary models is likewise not as simple. While we were able to determine SFR-Embedding-Mistral as the model being most similar to OpenAI’s embedding models, Jaccard similarity at top-10 for larger datasets shows a low overlap in retrieved text chunks. Furthermore, for Cohere’s embedding model, we were unable to find a single most similar model, as this model varied across datasets for small values.

7. Conclusion

In this paper we evaluated the similarity of embedding models on diferent datasets. Given the large number of available models, identifying clusters or families of models with similar embeddings can simplify the model selection process. While previous work on LLM similarity exists, to the best of the authors’ knowledge, it so far lacks a clear focus on embedding models specifically in the context of RAG. We therefore analyzed the similarity of embeddings generated by 19 diferent models using CKA for pairwise comparison as well as Jaccard and rank similarity to compare retrieval behavior at top- across five datasets. Comparing embeddings with CKA generally showed intra- and inter-family clusters across datasets. These clusters also appeared when evaluating top- retrieval similarity with large values. However, scores for low values, which would commonly be chosen in RAG systems, show high variance and much lower similarity, especially on larger datasets. Although we were able to identify some model clusters, our results suggest that choosing the optimal model remains a non-trivial task that requires careful consideration.

Still, we argue that a better understanding of how similarly diferent embedding models behave is an important research topic that requires further attention. There are a plethora of real-world scenarios where RAG systems can potentially be deployed. One such scenario, for example, is to retrieve relevant web content in response to a query. As large corpora of such data are available in the form of Web ARChive (WARC) files, evaluating embedding model similarity on such large, uncleaned datasets would perhaps lead to a better estimation of model similarity for a realistic RAG use case. Additionally, as documents often need to be chunked into smaller parts to fit into the models, the efect of chunking strategies such as token-based or semantic chunking on embedding similarity could be explored. Furthermore, our evaluation focused on a small sample of similarity measures, with their own definition about which criteria make models similar. Introducing more measures with diferent perspectives could improve our understanding on which factors influence model similarity. Finally, our focus was on identifying clusters or families of models, which for our initial experiments led us to choosing families of embedding models with varying performance on MTEB. With the frequent appearance of new models on the leaderboard and the focus on good MTEB performance, it would be of interest to compare the best performing models on MTEB and check if their relative diference in performance correlates with how similar these models are.

Acknowledgments

This work has received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014 (OpenWebSearch.EU, https: //doi.org/10.3030/101070014).

[1]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Y. J.

Bang ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Computing Surveys 55 ( 2023 ) 1 - 38 .

[2]

S. M.

Mousavi ,

Alghisi , G. Riccardi, Is your llm outdated? benchmarking llms & alignment algorithms for time-sensitive knowledge , 2024 . arXiv: 2404 . 08700 .

[3]

Muennighof ,

Tazi ,

Magne ,

Reimers , Mteb: Massive text embedding benchmark, 2023 . arXiv: 2210 . 07316 .

[4]

Klabunde ,

Schumacher ,

Strohmaier ,

Lemmerich , Similarity of neural network models: A survey of functional and representational measures , arXiv preprint arXiv:2305.06329 ( 2023 ).

[5]

Klabunde ,

M. B.

Amor ,

Granitzer ,

Lemmerich , Towards measuring representational similarity of large language models , arXiv preprint arXiv:2312.02730 ( 2023 ).

[6]

Raghu ,

Gilmer ,

Yosinski ,

Sohl-Dickstein , Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability , Advances in neural information processing systems 30 ( 2017 ).

[7]

Morcos ,

Raghu , S. Bengio, Insights on representational similarity in neural networks with canonical correlation , Advances in neural information processing systems 31 ( 2018 ).

[8]

D. R.

Hardoon ,

Szedmak ,

Shawe-Taylor , Canonical correlation analysis: An overview with application to learning methods , Neural computation 16 ( 2004 ) 2639 - 2664 .

[9]

Ding ,

J.-S.

Denain ,

Steinhardt , Grounding representation similarity through statistical testing , Advances in Neural Information Processing Systems 34 ( 2021 ) 1556 - 1568 .

[10]

Zullich ,

Pellegrino ,

Medvet ,

Ansuini , et al., On the similarity between hidden layers of pruned and unpruned convolutional neural networks , in: Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods , Scitepress, 2020 , pp. 52 - 59 .

[11]

Chen ,

Miao ,

Qiu , Inner product-based neural network similarity , Advances in Neural Information Processing Systems 36 ( 2024 ).

[12]

Li ,

Yosinski ,

Clune ,

Lipson ,

Hopcroft , Convergent learning: Do diferent neural networks learn the same representations ?, 2016 . arXiv: 1511 . 07543 .

[13]

Li ,

Yosinski ,

Clune ,

Lipson ,

Hopcroft , Convergent learning: Do diferent neural networks learn the same representations? , in: D. Storcheus , A. Rostamizadeh , S. Kumar (Eds.), Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015 , volume 44 of Proceedings of Machine Learning Research , PMLR, Montreal, Canada, 2015 , pp. 196 - 212 . URL: https: //proceedings.mlr.press/v44/li15convergent.html.

[14]

Kornblith ,

Norouzi ,

Lee ,

Hinton , Similarity of neural network representations revisited , in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning , volume 97 of Proceedings of Machine Learning Research, PMLR , 2019 , pp. 3519 - 3529 . URL: https://proceedings.mlr.press/v97/kornblith19a.html.

[15]

Bansal ,

Nakkiran ,

Barak , Revisiting model stitching to compare neural representations , in: M. Ranzato , A.

Beygelzimer , Y.

Dauphin , P.

Liang , J. W.

Vaughan (Eds.), Advances in Neural Information Processing Systems , volume 34 , Curran

Associates

, Inc., 2021 , pp. 225 - 236 . URL: https: //proceedings.neurips.cc/paper_files/paper/2021/file/ 01ded4259d101feb739b06c399e9cd9c-Paper.pdf .

[16]

Lenc ,

Vedaldi , Understanding image representations by measuring their equivariance and equivalence , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 991 - 999 .

[17]

Balogh ,

Jelasity , On the functional similarity of robust and non-robust neural representations , in: A. Krause , E.

Brunskill , K.

Cho , B.

Engelhardt , S.

Sabato , J. Scarlett (Eds.), Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Machine Learning Research, PMLR , 2023 , pp. 1614 - 1635 . URL: https://proceedings. mlr.press/v202/balogh23a.html.

[18]

Milani Fard ,

Cormier ,

Canini ,

Gupta , Launch and iterate: Reducing prediction churn , Advances in Neural Information Processing Systems 29 ( 2016 ).

[19]

Xie , L. Ma,

Wang ,

Li ,

Liu ,

Li , Difchaser: Detecting disagreements for deep neural networks , International Joint Conferences on Artificial Intelligence Organization , 2019 .

[20]

Li ,

Zhang ,

Liu ,

Yang , Y. Liu, Modeldif: testing-based dnn similarity comparison for model reuse detection , in: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis , ISSTA '21 , ACM , 2021 . URL: http: //dx.doi.org/10.1145/3460319.3464816. doi: 10 .1145/ 3460319.3464816.

[21] J. M. Wu , Y.

Belinkov , H.

Sajjad , N.

Durrani , F.

Dalvi , J.

Glass , Similarity analysis of contextual word representation models , 2020 . arXiv: 2005 .01172.

[22]

Freestone ,

S. K. K.

Santu , Word embeddings revisited: Do llms ofer something new? , 2024 . arXiv: 2402 . 11094 .

[23]

Brown , C. Godfrey,

Konz ,

Tu ,

Kvinge , Understanding the inner workings of language models through representation dissimilarity , 2023 . arXiv: 2310 . 14993 .

[24]

Thakur ,

Reimers ,

Rücklé ,

Srivastava , I. Gurevych , Beir: A heterogenous benchmark for zeroshot evaluation of information retrieval models , 2021 . arXiv: 2104 . 08663 .

[25]

Finardi ,

Avila ,

Castaldoni ,

Gengo ,

Larcher ,

Piau ,

Costa ,

Caridá , The chronicles of rag: The retriever, the chunk and the generator , 2024 . arXiv: 2401 . 07883 .

[26]

Sawarkar ,

Mangal ,

S. R.

Solanki , Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers , 2024 . arXiv: 2404 . 07220 .

[27]

Es ,

James ,

Espinosa-Anke ,

Schockaert , Ragas: Automated evaluation of retrieval augmented generation , 2023 . arXiv: 2309 . 15217 .

[28]

Gretton ,

Bousquet ,

Smola ,

Schölkopf , Measuring statistical dependence with hilbert-schmidt norms , in: S. Jain,

H. U.

Simon , E. Tomita (Eds.), Algorithmic Learning Theory , Springer Berlin Heidelberg, Berlin, Heidelberg, 2005 , pp. 63 - 77 .

[29]

Wang ,

Rao ,

Guo ,

Wang ,

Liu ,

Guan , Towards understanding the instability of network embedding , IEEE Transactions on Knowledge and Data Engineering 34 ( 2022 ) 927 - 941 . doi: 10 .1109/TKDE. 2020 . 2989512 .

[30]

Inc ., Chroma , Chroma Homepage, 2024 . URL: https: //docs.trychroma.com/.

[31]

Wang ,

Yang ,

Huang ,

Jiao ,

Yang ,

Jiang ,

Majumder ,

Wei , Text embeddings by weaklysupervised contrastive pre-training , arXiv preprint arXiv:2212.03533 ( 2022 ).

[32]

Ni ,

G. H.

Ábrego ,

Constant , J. Ma, K. B. Hall , D.

Cer , Y.

Yang , Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models , 2021 . arXiv: 2108 . 08877 .

[33]

Ni ,

Qu ,

Lu ,

Dai ,

G. H.

Ábrego ,

Ma , V. Y. Zhao , Y. Luan , K. B. Hall , M.- W.

Chang , Y.

Yang , Large dual encoders are generalizable retrievers , 2021 . arXiv: 2112 . 07899 .

[34]

Xiao ,

Liu ,

Zhang ,

Muennighof , C-pack: Packaged resources to advance general chinese embedding , 2023 . arXiv: 2309 . 07597 .

[35]

Li ,

Zhang ,

Long ,

Xie ,

Zhang , Towards general text embeddings with multi-stage contrastive learning , arXiv preprint arXiv:2308.03281 ( 2023 ).

[36] OpenAI , New embedding models with lower pricing , OpenAI Blog , 2024 . URL: https://openai.com/blog/ new-embedding -models-and-api-updates.

[37] Cohere , Embeddings - text embeddings with advanced language models , Cohere Homepage , 2024 . URL: https: //cohere.com/embeddings.

[38]

Lee ,

Shakir ,

Koenig ,

Lipp , Open source strikes bread - new flufy embeddings model , 2024 . URL: https: //www.mixedbread.ai/blog/mxbai-embed -large-v1.

[39]

Li ,

Li , Angle-optimized text embeddings , arXiv preprint arXiv:2309.12871 ( 2023 ).

[40]

Meng ,

Liu ,

S. R.

Joty ,

Xiong ,

Zhou ,

Yavuz , Sfr-embedding-mistral:enhance text retrieval with transfer learning , Salesforce AI Research Blog , 2024 . URL: https://blog.salesforceairesearch.com/ sfr-embedded-mistral/.

[41]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikitlearn: Machine learning in Python, Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 .

[42] M. L. Waskom , seaborn: statistical data visualization , Journal of Open Source Software 6 ( 2021 ) 3021 . URL: https://doi.org/10.21105/joss.03021. doi: 10 . 21105/joss.03021.