<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Laura</forename><surname>Caspari</surname></persName>
							<email>laura.caspari@uni-passau.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Kanishka</forename><surname>Ghosh Dastidar</surname></persName>
							<email>kanishka.ghoshdastidar@uni-passau.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Saber</forename><surname>Zerhoudi</surname></persName>
							<email>saber.zerhoudi@uni-passau.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jelena</forename><surname>Mitrovic</surname></persName>
							<email>jelena.mitrovic@uni-passau.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Michael</forename><surname>Granitzer</surname></persName>
							<email>michael.granitzer@uni-passau.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Passau</orgName>
								<address>
									<settlement>Passau</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="laboratory">ACM SIGIR Workshop on Information Retrieval&apos;s Role in RAG Systems</orgName>
								<address>
									<addrLine>July 18</addrLine>
									<postCode>2024</postCode>
									<region>Washington D.C</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">BD400077D5A4E5A3AE41226FC76BA8C1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large language model, Retrieval-augmented generation, Model similarity (M. Granitzer) 0009-0002-6670-3211 (L. Caspari)</term>
					<term>0000-0003-4171-0597 (K. G. Dastidar)</term>
					<term>0000-0003-2259-0462 (S. Zerhoudi)</term>
					<term>0000-0003-3220-8749 (J. Mitrovic)</term>
					<term>0000-0003-3566-5507 (M. Granitzer)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The choice of embedding model is a crucial step in the design of Retrieval Augmented Generation (RAG) systems. Given the sheer volume of available options, identifying clusters of similar models streamlines this model selection process. Relying solely on benchmark performance scores only allows for a weak assessment of model similarity. Thus, in this study, we evaluate the similarity of embedding models within the context of RAG systems. Our assessment is two-fold: We use Centered Kernel Alignment to compare embeddings on a pair-wise level. Additionally, as it is especially pertinent to RAG systems, we evaluate the similarity of retrieval results between these models using Jaccard and rank similarity. We compare different families of embedding models, including proprietary ones, across five datasets from the popular Benchmark Information Retrieval (BEIR). Through our experiments we identify clusters of models corresponding to model families, but interestingly, also some inter-family clusters. Furthermore, our analysis of top-𝑘 retrieval similarity reveals high-variance at low 𝑘 values. We also identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Motivation</head><p>Retrieval-Augmented Generation (RAG) is an emerging paradigm that helps mitigate the problems of factual hallucination <ref type="bibr" target="#b0">[1]</ref> and outdated training data <ref type="bibr" target="#b1">[2]</ref> of large language models (LLMs) by providing these models with access to an external, non-parametric knowledge source (e.g. a document corpus). Central to the functioning of RAG frameworks is the retrieval step, wherein a small subset of candidate documents is retrieved from the document corpus, specific to the input query or prompt. This retrieval process, known as dense-retrieval, hinges on text embeddings. Typically, the generation of these embeddings is assigned to an LLM, for which there are several options due to the rapid evolution of the field. Consequently, selecting the most suitable embedding model from an array of available choices emerges as a critical aspect in the development of RAG systems. The information to guide this choice is currently primarily limited to architectural details (which are also on occasion scarce due to the prevalence of closed models) and performance benchmarks such as the Massive Text Embedding Benchmark (MTEB) <ref type="bibr" target="#b2">[3]</ref>.</p><p>We posit that an analysis of the similarity of the embeddings generated by these models would significantly aid this model selection process. Given the large number of candidates and ever increasing scale of the models, a fromscratch empirical evaluation of the embedding quality of these LLMs on a particular task can incur significant costs. This challenge becomes especially pronounced when dealing with large-scale corpora comprising potentially millions of documents. While the relative performance scores of these models on benchmark datasets offer the simplified perspective of comparing a single scalar value on an array of downstream tasks, such a view of model similarity might overlook the nuances of the relative behaviour of the models <ref type="bibr" target="#b3">[4]</ref>. As an example, the absolute difference in precision@k between two retrieval systems only provides a weak indication of the overlap of retrieved results. We argue that identifying clusters of models with similar behaviour would allow practitioners to construct smaller, yet diverse candidate pools of models to evaluate. Beyond model selection, as highlighted by Klabunde et al., <ref type="bibr" target="#b4">[5]</ref>, such an analysis also facilitates the identification of common factors contributing to strong performance, easier model ensembling, and detection of potential instances of unauthorized model reuse.</p><p>In this paper, we analyze different LLMs in terms of the similarities of the embeddings they generate. Our similarity analysis serves as an unsupervised evaluation framework for these embedding models, in contrast to performance benchmarks that require labelled data. We do this from a dual perspective -we directly compare the embeddings using representational similarity measures. Additionally, we evaluate model similarity specifically in terms of their functional impact on RAG systems i.e. we look at how similar the retrieved results are. Our evaluation focuses on several prominent model families, to analyze similarities both within and across them. We also compare proprietary models (such as those by OpenAI or <ref type="bibr">Cohere)</ref> to open-sourced ones in order to identify the most similar alternatives. Our experiments are carried out on five popular benchmark datasets to determine if similarities between models are influenced by the choice of data. Our code is available at https://github.com/casparil/embedding-model-similarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>The datasets used for generating embeddings with their number of queries and corpus size. technique used to find the linear relationship between two sets of variables by maximizing their correlation. Such comparisons using CCA or variants thereof can be found in several works <ref type="bibr" target="#b8">[9]</ref>, <ref type="bibr" target="#b9">[10]</ref>, <ref type="bibr" target="#b10">[11]</ref>. Beyond CCA-based measures, other works have also explored computing correlations <ref type="bibr" target="#b11">[12]</ref> and the mutual information <ref type="bibr" target="#b12">[13]</ref> between neurons across networks. Kornblith et al. <ref type="bibr" target="#b13">[14]</ref> propose Centered Kernel Alignment (CKA), which they show improves over several similarity measures in identifying corresponding layers of identical networks with different initializations. A diverse range of functional similarity evaluations have also been explored in the literature. A few examples include modelstitching <ref type="bibr" target="#b14">[15]</ref>, <ref type="bibr" target="#b15">[16]</ref>, <ref type="bibr" target="#b16">[17]</ref>, disagreement measures between output classes <ref type="bibr" target="#b17">[18]</ref>, <ref type="bibr" target="#b18">[19]</ref>, and quantifying the similarity between the class-wise output probabilities <ref type="bibr" target="#b19">[20]</ref>. We would point the reader to the survey by Klabunde et al. <ref type="bibr" target="#b3">[4]</ref> for a detailed overview of representational and functional similarity measures.</p><p>Recently, a few works have also focused on specifically evaluating the similarity of LLMs. While Wu et al. <ref type="bibr" target="#b20">[21]</ref> evaluate language models along several perspectives, such as their representational and neuron-level similarities, their evaluation pre-dates the introduction of the recent wave of large scale models. Freestone and Santu <ref type="bibr" target="#b21">[22]</ref> consider similarities of word embeddings, and evaluate if LLMs differ significantly to classical encoding models in terms of their representations. The works by Klabunde et al. <ref type="bibr" target="#b4">[5]</ref> and Brown et al. <ref type="bibr" target="#b22">[23]</ref> are more recent, and evaluate the representational similarity of LLMs, with the latter also considering the similarities between models of different sizes in the same model family.</p><p>Much of the literature on evaluation of LLM embeddings focuses on their performance on downstream tasks, with benchmarks such as BEIR <ref type="bibr" target="#b23">[24]</ref> (for retrieval specifically) and MTEB <ref type="bibr" target="#b2">[3]</ref> providing a unified view of embedding quality across metrics and datasets. The metrics used here mostly include typical information retrieval metrics such as precision, recall, and mean reciprocal rank at certain cutoffs. Some works specifically evaluate the retrieval components in a RAG context, where they either use a dataset outside of those included in the benchmarks <ref type="bibr" target="#b24">[25]</ref> or where the evaluation encompasses other aspects of the retriever beyond the embedding model being used <ref type="bibr" target="#b25">[26]</ref>. Another approach, that does not rely on ground-truth labels, is given by the Retrieval Augmented Generation Assessment (RAGAS) framework, which uses an LLM to determine the ratio of sentences in the retrieved context that are relevant to the answer being generated <ref type="bibr" target="#b26">[27]</ref>. To the best of our knowledge, there are no works that evaluate the similarity of embedding models from a retrieval perspective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>We evaluate embedding model similarity using two approaches. The first directly compares the embeddings of text chunks generated by the models. The second approach is specific to the RAG context, where we evaluate the similarity of retrieved results for a given query. These approaches are discussed in detail in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Pair-wise Embedding Similarity</head><p>There are several metrics defined in the literature that measure representational similarity <ref type="bibr" target="#b3">[4]</ref>. Many of these metrics require the representation spaces of the embeddings to be compared to be aligned and/or the dimensionality of the embeddings across the models to be identical. To avoid these constraints, we pick Centered Kernel Alignment (CKA) <ref type="bibr" target="#b13">[14]</ref> with a linear kernel as our similarity measure.</p><p>The measure computes similarity between two sets of embeddings in two steps. First, for a set of embeddings, the pair-wise similarity scores between all entries within this set are computed using the kernel function. Thus, row k of the resulting similarity matrix contains entries representing the similarity between embedding k and all other embeddings, including itself. Computing two such embedding similarity matrices for different models with the same number of embeddings then leads to two matrices E and E' of matching dimensions. These are compared directly in the second step with the Hilbert-Schmidt Independence Criterion (HSIC) <ref type="bibr" target="#b27">[28]</ref> using the following formula:</p><formula xml:id="formula_0">𝐶𝐾𝐴(𝐸, 𝐸 ′ ) = 𝐻𝑆𝐼𝐶(𝐸,𝐸 ′ ) √ 𝐻𝑆𝐼𝐶(𝐸,𝐸)𝐻𝑆𝐼𝐶(𝐸 ′ ,𝐸 ′ )</formula><p>The resulting similarity scores are bounded in the interval [0, 1] with a score of 1 indicating equivalent representations. CKA assumes that representations are mean-centered.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Retrieval Similarity</head><p>While a pair-wise comparison of embeddings offers insights into the similarities of the representations learned by these models, it does not suffice to quantify the similarities in outcomes when these embedding models are deployed for specific tasks. Therefore, in context of RAG systems, we consider the similarity of retrieved text chunks for a given query, when different embedding models are used. As a first step, for a given dataset, we generate embeddings of queries and document chunks with each of the embedding models. We then retrieve the 𝑘 most similar embeddings in terms of the cosine similarity for a particular query. As these embeddings correspond to specific chunks of text, we derive the sets of retrieved chunks C and C' for a pair of models. To measure the similarity of these sets, we use the Jaccard similarity coefficient as follows:</p><formula xml:id="formula_1">𝐽𝑎𝑐𝑐𝑎𝑟𝑑(𝐶, 𝐶 ′ ) = |𝐶∩𝐶 ′ | |𝐶∪𝐶 ′ |</formula><p>Here, |𝐶 ∩ 𝐶 ′ | corresponds to the overlap in text chunks by counting how often the two models retrieved the same chunks. Similarly, we can compute the union |𝐶 ∪ 𝐶 ′ |, which corresponds to all retrieved text chunks, counting chunks present in both sets only once. The resulting score is bounded in the interval [0, 1] with 1 indicating that both models retrieved the same set of text chunks.</p><p>While Jaccard similarity computes the percentage to which two sets overlap, it ignores the order in the sets. Rank similarity <ref type="bibr" target="#b28">[29]</ref>, on the other hand, considers the order of common elements, with closer elements having a higher impact on the score. The measure assigns ranks to common text chunks according to their similarity to the query, i.e. 𝑟𝐶 (𝑗) = 𝑛 if chunk 𝑗 was the top-𝑛 retrieved result for the query. Ranks are then compared using:</p><formula xml:id="formula_2">𝑅𝑎𝑛𝑘(𝑟𝐶 (𝑗), 𝑟 𝐶 ′ (𝑗)) = 2 (1+|𝑟 𝐶 (𝑗)−𝑟 𝐶 ′ (𝑗)|)(𝑟 𝐶 (𝑗)+𝑟 𝐶 ′ (𝑗))</formula><p>With this, rank similarity for two sets of retrieved text chunks C, C' is calculated as:</p><formula xml:id="formula_3">𝑅𝑎𝑛𝑘𝑆𝑖𝑚(𝐶, 𝐶 ′ ) = 1 𝐻(|𝐶∩𝐶 ′ |) ∑︁ 𝑗∈|𝐶∩𝐶 ′ | 𝑅𝑎𝑛𝑘(𝑟𝐶 (𝑗), 𝑟 𝐶 ′ (𝑗)) with 𝐻(|𝐶 ∩ 𝐶 ′ |) = ∑︀ 𝐾=|𝐶∩𝐶 ′ | 𝑘=1 1</formula><p>𝑘 denoting the K-th harmonic number, normalizing the score. Like the other measures, rank similarity is bounded in the interval [0, 1] with 1 indicating that all ranks are identical.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head><p>The following paragraphs describe our choice of datasets and models, along with details of the implementation of our experiments.</p><p>As we focus on the retrieval component of RAG systems, we select five publicly available datasets from the BEIR benchmark <ref type="bibr" target="#b23">[24]</ref>. As generating embeddings for large datasets is a time-intensive process, especially for a larger number of models, we opt for five of the smaller datasets from the benchmark. This approach allows us to compare embeddings generated by a variety of models while at the same time allowing us to evaluate embedding similarity accross datasets. An overview of the datasets is shown in Table <ref type="table">1</ref>. For each dataset, we create embeddings by splitting documents into text chunks such that each chunk contains 256 tokens. The embedding vectors are stored with Chroma DB <ref type="bibr" target="#b29">[30]</ref>, an open source embedding database. For each vector, we additionally store information about the document and text chunk ids it encodes to be able to match embeddings generated by different models for evaluation.</p><p>For model selection, we primarily use publicly available models from the MTEB leaderboard <ref type="bibr" target="#b2">[3]</ref>. We do not simply pick the best performing models on the leaderboard; instead, our choices are influenced by several factors. Firstly, we focus on analyzing similarities within and across model families and pick models belonging to the e5 <ref type="bibr" target="#b30">[31]</ref>, t5 <ref type="bibr" target="#b31">[32,</ref><ref type="bibr" target="#b32">33]</ref>, bge <ref type="bibr" target="#b33">[34]</ref>, and gte <ref type="bibr" target="#b34">[35]</ref> families. Secondly, we recognize that it might be of interest to users to avoid pay-by-token policies of proprietary models by identifying similar opensource alternatives. Therefore, we pick high-performing proprietary models, two from OpenAI (text-embedding-3large and -small) <ref type="bibr" target="#b35">[36]</ref> and one from Cohere (Cohere embedenglish-v3.0) <ref type="bibr" target="#b36">[37]</ref>. We also compare the mxbai-embed-large-v1 (mxbai) <ref type="bibr" target="#b37">[38]</ref> and UAE-Large-V1 (UAE) <ref type="bibr" target="#b38">[39]</ref> models, that not only report very similar performances on MTEB, but also identical embedding dimensions, model size and memory usage. Finally, we include SFR-Embedding-Mistral (Mistral) <ref type="bibr" target="#b39">[40]</ref> as the best-performing model on the leaderboard at the time of our experiments. A detailed overview of all selected models can be seen in Table <ref type="table" target="#tab_1">2</ref>.</p><p>To compare embedding similarity across models and datasets, we employ different strategies depending on the similarity measure. We apply CKA by retrieving all embeddings created by a model, matching embeddings using their document and text chunk ids and then computing their similarity for each of the five datasets. For Jaccard and rank similarity, we use sklearn's NearestNeighbor class <ref type="bibr" target="#b40">[41]</ref> to determine the the top-𝑘 retrieval results. We compute Jaccard and rank scores per dataset, averaging over 25 queries. For the NFCorpus dataset, we calculate retrieval similarity for all possible 𝑘, i.e. using all embeddings generated for the dataset. As calculating similarity for each possible 𝑘 is computationally expensive, we did not repeat this for the remaining datasets and chose a smaller 𝑘 value instead. Furthermore, as only a limited number of results are to be provided as context to the generative model, ana- lyzing retrieval similarity at low 𝑘 values for e.g. top-10 is of most interest. As we are interested in identifying clusters of similar models, we also perform a hierarchical clustering on heatmap values using Seaborn <ref type="bibr" target="#b41">[42]</ref>. The following section describes the results of our evaluation for the different measures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>To evaluate how similar embeddings generated by different models are, we will first consider model families, checking if their pairwise and top-k similarity scores are highest within their family. Subsequently, we will identify the open source models which are most similar to our chosen proprietary models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Intra-and Inter-Family Clusters</head><p>Comparing embeddings directly with CKA shows high similarity across most of the models, albeit with some variance. These scores allow us to identify certain clusters of models.</p><p>Figure <ref type="figure" target="#fig_0">1</ref> shows the pair-wise CKA scores of all models averaged across the five datasets. As expected, scores for most models are highest within their own family. This holds true for the gtr-t5, sentence-t5 and text-embedding-3 (OpenAI) models. Although the sentence-t5 and gtr-t5 models are closely related, they do not exhibit significantly higher similarity with each other compared to the remaining models.</p><p>From an inter-family perspective, we observe high similarity between the bge and gte models. For some models in these two families, interestingly, the highest similarity scores rather correspond to inter-family counterparts with matching embedding dimensions than with models in the same family. Specifically, gte-small reports the highest similarity to bge-small and gte-base to bge-base. On the other hand, gte-large shows slightly higher similarity to bge-base than bge-large and thus to a model with a lower embedding dimension. Another inter-family cluster is formed by the three models with the highest CKA scores overall, namely UAE, mxbai and bge-large, whose scores suggest almost perfect embedding similarity. In fact, the similarity score of bge-large to these two models is much higher than to other bge models.</p><p>Shifting our attention to top-𝑘 retrieval similarity, clusters vary depending on the 𝑘 value. Figure <ref type="figure" target="#fig_4">3</ref> illustrates how Jaccard similarity evolves over 𝑘 on NFCorpus. The first plot displays Jaccard scores between bge-large and all other models, while the second plot illustrates the scores for gtelarge. For extremely low 𝑘, we observe some peaks for nearly all models, followed by a noticeable drop in similarity. Of course, for larger 𝑘, the scores converge to one. Reaffirming our earlier observations with the CKA metric, bge-large demonstrates high retrieval similarity with UAE and mxbai. Similarity to the remaining models is much lower, with the highest scores for bge-base and bge-small for larger 𝑘. However, especially for small 𝑘, there is high variance in similarity score, with models from other families, e.g. Mistral or gte-large sometimes achieving higher scores than the bge models. A similar pattern can also be observed in the second plot, where Jaccard similarity for gte-large is highest within its family for larger 𝑘, but models like mxbai or bge-base sometimes reporting higher similarity for small 𝑘. Therefore, the clusters we identified through our CKA analysis are only truly reflected in these plots for large values of 𝑘. This suggest that in real-world use cases, where the top-𝑘 are crucial, such representational similarity measures might not provide the full picture. The plots for other model families provide nearly identical insights as those in the second plot in Figure <ref type="figure" target="#fig_4">3</ref> and thus we do not present them for sake of brevity.</p><p>For rank similarity, scores peak for small 𝑘 and then quickly start to drop until they approach a low stable score for larger 𝑘 as shown in Figure <ref type="figure" target="#fig_1">2</ref> for gte-large. Once again, the bge/UAE/mxbai inter-family cluster shows the highest similarity. In contrast to Jaccard similarity, the clusters that could be observed for CKA do not always show for rank similarity. As can be seen in Figure <ref type="figure" target="#fig_1">2</ref>, the model with the highest rank similarity to gte-large is mxbai, rather than bge-large-en-v1.5_vs_SFR-Embedding-Mistral bge-large-en-v1.5_vs_UAE-Large-V1 bge-large-en-v1.5_vs_bge-base-en-v1.5 bge-large-en-v1.5_vs_bge-small-en-v1.5 bge-large-en-v1.5_vs_e5-base-v2 bge-large-en-v1.5_vs_e5-large-v2 bge-large-en-v1.5_vs_e5-small-v2 bge-large-en-v1.5_vs_embed-english-v3.0 bge-large-en-v1.5_vs_gte-base bge-large-en-v1.5_vs_gte-large bge-large-en-v1.5_vs_gte-small bge-large-en-v1.5_vs_gtr-t5-base bge-large-en-v1.5_vs_gtr-t5-large bge-large-en-v1.5_vs_mxbai-embed-large-v1 bge-large-en-v1.5_vs_sentence-t5-base bge-large-en-v1.5_vs_sentence-t5-large bge-large-en-v1.5_vs_text-embedding-3-large bge-large-en-v1.5_vs_text-embedding-3-small    another gte model. Even so, the previously observed clusters also tend to appear for rank similarity, though they vary more depending on the models and dataset. Generally, scores for nearly all models are rather small for larger 𝑘, indicating low rank similarity. For small 𝑘, results vary more and differences between individual models are more pronounced.</p><p>As retrieval similarity at small 𝑘 is of most interest from a practical perspective, we take a closer look at top-10 Jaccard similarity. The heatmaps in Figures <ref type="figure" target="#fig_11">4-6</ref> show the top-10 Jac-card similarity between models across datasets. A striking insight here is that even the most similar models only report a Jaccard similarity of higher than 0.6, with the majority less than 0.5. This is of great relevance to practitioners, as it would imply that even using embeddings from models that report high representational similarity scores may yield little overlap in retrieved text chunks. As earlier, the cluster of UAE/mxbai/bge-large is the most prominent one with the highest scores. Intra-family scores tend to be the highest for most models, i.e. t5 and OpenAI. Depending on the   .00 0.12 0.14 0.34 0.30 0.13 0.12 0.13 0.10 0.10 0.12 0.10 0.10 0.12 0.12 0.10 0.09 0.11 0.10 0.12 1.00 0.25 0.14 0.13 0.16 0.14 0.12 0.13 0.10 0.14 0.13 0.10 0.14 0.14 0.12 0.11 0.12 0.14 0.14 0.25 1.00 0.16 0.17 dataset, this also applies to gte and e5 models, although Jaccard similarity to models from other families is sometimes higher. We also note that for the two larger datasets FiQA-2018 and TREC-COVID, the similarity scores are generally substantially lower, as can be seen in Figure <ref type="figure" target="#fig_11">6</ref>. For the smaller datasets, Jaccard similarity is generally higher, though results differ depending on the data (see Figures <ref type="figure" target="#fig_10">4  and 5</ref>). Similar observations can be made for rank similarity, although the appearance of family clusters is more dependent on the dataset. Larger datasets also lead to lower scores. These results illustrate that while family clusters can still be perceived at small 𝑘, they are not as prominent as they are for larger 𝑘. Furthermore, the top-10 retrieved results differ substantially for most models and datasets and their similarity might be dependent on the dataset itself.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Open Source Alternatives to Proprietary Models</head><p>We explicitly included proprietary models in our analysis to check which open source models are the best -which in our case means the most similar -alternative. The CKA scores in Figure <ref type="figure" target="#fig_0">1</ref> indicate that embeddings generated by OpenAI's models (text-embedding-3-large/-small) are highly similar to those generated by Mistral, while the Cohere model (embedenglish-v3.0) demonstrates high similarity to e5-large-v2. These observations do not entirely extend to retrieval similarity, especially for Cohere. While Mistral is still the most similar model to OpenAI's for larger 𝑘 across all datasets, there is no consistently most similar model for Cohere. Rather, the model varies depending on the dataset and measure -Jaccard and rank similarity -sometimes showing highest similarity to e5-large-v2, but sometimes also to other models like Mistral. Taking a closer look at top-10 similarity, Mistral still largely exhibits the highest similarity to the OpenAI models, especially to text-embedding-3-large. For text-embedding-3-small, scores on all datasets are rather close to those of other models. In absolute terms, however, retrieval similarity between Mistral and OpenAI models is only low to moderate. On smaller datasets, the highest Jaccard similarity to text-embedding-3-large only reaches about 0.6 (see Figure <ref type="figure" target="#fig_6">5</ref>), while on TREC-COVID, the largest dataset, Jaccard similarity goes down to merely 0.18 (see Figure <ref type="figure" target="#fig_11">6</ref>). For Cohere's model, the most similar model for top-10 Jaccard similarity is different for each dataset, with the highest scores of 0.51 occurring on ArguAna shwon in Figure <ref type="figure" target="#fig_6">5</ref>. For all proprietary models, even the best retrieval similarity at top-10 still suggests that the embeddings that would be presented to an LLM can differ notably. Once again, we could also observe dataset-dependent variance in scores, with lower retrieval similarity on larger datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>While a pair-wise comparison of embeddings using CKA shows intra-and inter-family model clusters, retrieval similarity over different 𝑘 offers a more nuanced picture. Especially for small 𝑘, which are of most interest from a practical perspective, retrieval similarity varies. When comparing the top-10 retrieved text chunks, the low Jaccard similarity scores indicate little overlap in retrieved chunks, even when CKA scores are high. Especially for the two larger datasets FiQA-2018 and TREC-COVID, these scores are extremely low. As RAG systems usually operate on millions of embeddings, our datasets are comparatively small. Therefore, should a general trend of larger datasets leading to lower retrieval similarity exist, text chunks retrieved by different models in a regular use case might be nearly distinct for small 𝑘. Overall, our results suggest that even though embeddings seem rather similar when compared directly, retrieval performance can still vary substantially, is most unstable for 𝑘 values that are commonly used in RAG systems and also dataset-dependent. Retrieved chunks at small 𝑘 show the least overlap, leading to high differences in data that would be presented to an LLM as additional context. 1.00 0.11 0.11 0.12 0.09 0.13 0.13 0.10 0.14 0.12 0.12 0.12 0.15 0.12 0.11 0.11 0.25 0.18 0.11 1.00 0.20 0.52 0.16 0.12 0.11 0.10 0.21 0.35 0.22 0.12 0.11 0.64 0.13 0.09 0.15 0.17 0.11 0.20 1.00 0.17 0.19 0.13 0.11 0.09 0.26 0.20 0.16 0.10 0.10 0.20 0.11 0.07 0.11 0.14 0.12 0.52 0.17 1.00 0.12 0.13 0.10 0.09 0.19 0.28 0.17 0.12 0.11 0.49 0.12 0.10 0.15 0.17 0.09 0.16 0.19 0.12 1.00 0.10 0.10 0.11 0.14 0.14 0.24 0.11 0.10 0.15 0.08 0.06 0.10 0.10 0.13 0.12 0.13 0.13 0.10 1.00 0.19 0.18 0.15 0.13 0.13 0.10 0.10 0.13 0.09 0.08 0.14 0.15 0.13 0.11 0.11 0.10 0.10 0.19 1.00 0.16 0.10 0.09 0.09 0.10 0.11 0.10 0.08 0.06 0.11 0.13 0.10 0.10 0.09 0.09 0.11 0.18 0.16 1.00 0.07 0.08 0.09 0.07 0.08 0.09 0.07 0.07 0.07 0.08 0.14 0.21 0.26 0.19 0.14 0.15 0.10 0.07 1.00 0.29 0.24 0.12 0.11 0.25 0.12 0.10 0.15 0.18 0.12 0.35 0.20 0.28 0.14 0.13 0.09 0.08 0.29 1.00 0.27 0.12 0.10 0.40 0.14 0.10 0.17 0.19 0.12 0.22 0.16 0.17 0.24 0.13 0.09 0.09 0.24 0.27 1.00 0.10 0.10 0.24 0.12 0.08 0.16 0.17 0.12 0.12 0.10 0.12 0.11 0.10 0.10 0.07 0.12 0.12 0.10 1.00 0.27 0.12 0.16 0.12 0.14 0.14 0.15 0.11 0.10 0.11 0.10 0.10 0.11 0.08 0.11 0.10 0.10 0.27 1.00 0.13 0.13 0.14 0.16 0.15 0.12 0.64 0.20 0.49 0.15 0.13 0.10 0.09 0.25 0.40 0.24 0.12 0.13 1.00 0.15 0.12 0.16 0.18 0.11 0.13 0.11 0.12 0.08 0.09 0.08 0.07 0.12 0.14 0.12 0.16 0.13 0.15 1.00 0.21 0.13 0.13 0.11 0.09 0.07 0.10 0.06 0.08 0.06 0.07 0.10 0.10 0.08 0.12 0.14 0.12 0.21 1.00 0.11 0.12 0.25 0.15 0.11 0.15 0.10 0.14 0.11 0.07 0.15 0.17 0.16 0.14 0.16 0.16 0.13 0.11 1.00 0.31 0.18 0.17 0.14 0.17 0.10 0.15 0.13 0.08 0.18 0.19 0.17 0.14 0.15 0.18 0.13 0.12 0.31 1.00 1.00 0.08 0.12 0.10 0.08 0.12 0.06 0.04 0.09 0.07 0.08 0.09 0.11 0.08 0.05 0.05 0.18 0.18 0.08 1.00 0.23 0.54 0.25 0.11 0.07 0.04 0.13 0.21 0.14 0.12 0.11 0.71 0.10 0.08 0.09 0.10 0.12 0.23 1.00 0.18 0.20 0.14 0.08 0.05 0.24 0.19 0.16 0.09 0.11 0.24 0.11 0.10 0.12 0.15 0.10 0.54 0.18 1.00 0.21 0.10 0.06 0.03 0.12 0.14 0.12 0.13 0.13 0.51 0.09 0.07 0.09 0.11 0.08 0.25 0.20 0.21 1.00 0.11 0.09 0.05 0.12 0.18 0.20 0.11 0.09 0.23 0.09 0.05 0.09 0.09 0.12 0.11 0.14 0.10 0.11 1.00 0.22 0.16 0.09 0.12 0.11 0.09 0.10 0.12 0.08 0.05 0.13 0.15 0.06 0.07 0.08 0.06 0.09 0.22 1.00 0.22 0.04 0.08 0.07 0.04 0.05 0.07 0.05 0.03 0.07 0.06 0.04 0.04 0.05 0.03 0.05 0.16 0.22 1.00 0.02 0.05 0.05 0.04 0.04 0.05 0.05 0.03 0.04 0.05 0.09 0.13 0.24 0.12 0.12 0.09 0.04 0.02 1.00 0.23 0.21 0.08 0.08 0.15 0.09 0.05 0.12 0.17 0.07 0.21 0.19 0.14 0.18 0.12 0.08 0.05 0.23 1.00 0.28 0.07 0.07 0.23 0.10 0.06 0.12 0.13 0.08 0.14 0.16 0.12 0.20 0.11 0.07 0.05 0.21 0.28 1.00 0.07 0.08 0.17 0.10 0.07 0.11 0.12 0.09 0.12 0.09 0.13 0.11 0.09 0.04 0.04 0.08 0.07 0.07 1.00 0.26 0.13 0.10 0.09 0.09 0.10 0.11 0.11 0.11 0.13 0.09 0.10 0.05 0.04 0.08 0.07 0.08 0.26 1.00 0.12 0.08 0.08 0.10 0.14 0.08 0.71 0.24 0.51 0.23 0.12 0.07 0.05 0.15 0.23 0.17 0.13 0.12 1.00 0.12 0.09 0.09 0.11 0.05 0.10 0.11 0.09 0.09 0.08 0.05 0.05 0.09 0.10 0.10 0.10 0.08 0.12 1.00 0.23 0.07 0.07 0.05 0.08 0.10 0.07 0.05 0.05 0.03 0.03 0.05 0.06 0.07 0.09 0.08 0.09 0.23 1.00 0.06 0.07 0.18 0.09 0.12 0.09 0.09 0.13 0.07 0.04 0.12 0.12 0.11 0.09 0.10 0.09 0.07 0.06 1.00 0.29 0.18 0.10 0.15 0.11 0.09 0.15 0.06 0.05 0.17 0.13 0.12 0.10 0.14 0.11 0.07 0.07 0.29 1.00 Our analysis demonstrates that although models tend to be most similar to models from their own family, interfamily clusters exist. The most prominent of these clusters is formed by the models bge-large-en-v1.5, UAE-Large-V1 and mxbai-embed-large-v1, which demonstrate high similarity even for retrieval at low 𝑘. Nevertheless, the high variance of retrieval similarity of the remaining clusters for small 𝑘 means that while the identified clusters might provide some measure of orientation when choosing an embedding model, the choice still remains a non-trivial task. Identifying suitable alternatives to proprietary models is likewise not as simple. While we were able to determine SFR-Embedding-Mistral as the model being most similar to OpenAI's embedding models, Jaccard similarity at top-10 for larger datasets shows a low overlap in retrieved text chunks. Furthermore, for Cohere's embedding model, we were unable to find a single most similar model, as this model varied across datasets for small 𝑘 values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In this paper we evaluated the similarity of embedding models on different datasets. Given the large number of available models, identifying clusters or families of models with similar embeddings can simplify the model selection process. While previous work on LLM similarity exists, to the best of the authors' knowledge, it so far lacks a clear focus on embedding models specifically in the context of RAG. We therefore analyzed the similarity of embeddings generated by 19 different models using CKA for pairwise comparison as well as Jaccard and rank similarity to compare retrieval behavior at top-𝑘 across five datasets. Comparing embeddings with CKA generally showed intra-and inter-family clusters across datasets. These clusters also appeared when evaluating top-𝑘 retrieval similarity with large 𝑘 values. However, scores for low 𝑘 values, which would commonly be chosen in RAG systems, show high variance and much lower similarity, especially on larger datasets. Although we were able to identify some model clusters, our results suggest that choosing the optimal model remains a non-trivial task that requires careful consideration.</p><p>Still, we argue that a better understanding of how similarly different embedding models behave is an important research topic that requires further attention. There are a plethora of real-world scenarios where RAG systems can potentially be deployed. One such scenario, for example, is to retrieve relevant web content in response to a query. As large corpora of such data are available in the form of Web ARChive (WARC) files, evaluating embedding model similarity on such large, uncleaned datasets would perhaps lead to a better estimation of model similarity for a realistic RAG use case. Additionally, as documents often need to be chunked into smaller parts to fit into the models, the effect of chunking strategies such as token-based or semantic chunking on embedding similarity could be explored. Furthermore, our evaluation focused on a small sample of similarity measures, with their own definition about which criteria make models similar. Introducing more measures with different perspectives could improve our understanding on which factors influence model similarity. Finally, our focus was on identifying clusters or families of models, which for our initial experiments led us to choosing families of embedding models with varying performance on MTEB. With the frequent appearance of new models on the leaderboard and the focus on good MTEB performance, it would be of interest to compare the best performing models on MTEB and check if their relative difference in performance correlates with how similar these models are.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1:Mean CKA similarity across all five datasets. Models tend to be most similar to models belonging to their own family, though some interesting inter-family patterns are visible as well.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2:Rank similarity over all on NFCorpus, comparing gte-large to all other models. Scores are highest and vary most for small 𝑘, but then drop quickly before stabilizing for larger 𝑘.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>-Embedding-Mistral gte-large_vs_UAE-Large-V1 gte-large_vs_bge-base-en-v1.5 gte-large_vs_bge-large-en-v1.5 gte-large_vs_bge-small-en-v1.5 gte-large_vs_e5-base-v2 gte-large_vs_e5-large-v2 gte-large_vs_e5-small-v2 gte-large_vs_embed-english-v3.0 gte-large_vs_gte-base gte-large_vs_gte-small gte-large_vs_gtr-t5-base gte-large_vs_gtr-t5-large gte-large_vs_mxbai-embed-large-v1 gte-large_vs_sentence-t5-base gte-large_vs_sentence-t5-large gte-large_vs_text-embedding-3-large gte-large_vs_text-embedding-3-small (b)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3:Jaccard similarity over all 𝑘 on NFCorpus, comparing bge-large (a) and gte-large (b) to all other models. While bge-large shows high similarity to UAE-Large-v1 and mxbai-embed-large-v1, scores for gte-large are clustered much closer. Jaccard similarity seems to be most unstable for small values of 𝑘, which would commonly be chosen for retrieval tasks.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Jaccard (a) and rank similarity (b) for the top-10 retrieved text chunks averaged over 25 queries on NFCorpus.The clusters vary slightly depending on the measure, as do the scores. Models tend to be most similar to models from their own family. However, some inter-family clusters are visible as well.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>5 e5</head><label>5</label><figDesc>UAE-Large-V1bge-base-en-v1.5 bge-large-en-v1.5 bge-small-en-v1.5 e5-base-v2 e5-large-v2 e5-small-v2 embed-english-v3.0 gte-base gte-large gte-small gtr-t5-base gtr-t5-large mxbai-embed-large-v1 sentence-t5-base sentence-t5-large text-embedding-3-large text-embedding-3-small SFR-Embedding-Mistral UAE-Large-V1 bge-base-en-v1.5 bge-large-en-v1.5bge-small-en-v1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head></head><label></label><figDesc>0.20 0.41 0.10 0.12 0.29 0.28 0.24 0.28 0.29 0.25 0.26 0.25 0.22 0.16 0.26 0.38 0.38 1.00 0.17 0.19 0.32 0.09 0.11 0.25 0.23 0.22 0.18 0.18 0.18 0.13 0.18 0.15 0.13 0.21 0.16 0.17 0.17 1.00 0.35 0.19 0.11 0.12 0.24 0.22 0.28 0.21 0.20 0.20 0.16 0.23 0.17 0.14 0.23 0.21 0.20 0.19 0.35 1.00 0.23 0.10 0.14 0.28 0.25 0.30 0.76 0.35 0.59 0.25 0.27 0.25 0.19 0.37 0.37 0.41 0.32 0.19 0.23 1</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>5 e5Figure 5 :</head><label>55</label><figDesc>Figure 5: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on SciFact (a) and ArguAna (b). The UAE and mxbai models show high levels of similarity along with bge-large. The remaining models tend to show the highest similarity within their own family with the exception of the bge/gte inter-family cluster.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>5 e5</head><label>5</label><figDesc>UAE-Large-V1bge-base-en-v1.5 bge-large-en-v1.5 bge-small-en-v1.5 e5-base-v2 e5-large-v2 e5-small-v2 gte-base gte-large gte-small gtr-t5-base gtr-t5-large mxbai-embed-large-v1 sentence-t5-base sentence-t5-large text-embedding-3-large text-embedding-3-small SFR-Embedding-Mistral UAE-Large-V1 bge-base-en-v1.5 bge-large-en-v1.5bge-small-en-v1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>5 e5</head><label>5</label><figDesc>bge-base-en-v1.5 bge-large-en-v1.5 bge-small-en-v1.5 e5-base-v2 e5-large-v2 e5-small-v2 gte-base gte-large gte-small gtr-t5-base gtr-t5-large mxbai-embed-large-v1 sentence-t5-base sentence-t5-large text-embedding-3-large text-embedding-3-small SFR-Embedding-Mistral UAE-Large-V1 bge-base-en-v1.5 bge-large-en-v1.5bge-small-en-v1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Jaccard similarity for the top-10 retrieved text chunks averaged over 25 queries on FiQA-2018 (a) and TREC-COVID (b). Most models seem to retrieve almost completely distinct text chunks. Only the bge/UAE/mxbai cluster still shows a notable level of similarity, while the scores of the remaining clusters indicate only moderate to low levels of similarity.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>We compare a diverse set of open source models from different families as well as proprietary models with varying performance on MTEB.</figDesc><table><row><cell>Model</cell><cell cols="4">Embedding dimension Max. Tokens MTEB Average Open Source</cell></row><row><cell>SFR-Embedding-Mistral</cell><cell>4096</cell><cell>32768</cell><cell>67.56</cell><cell>✓</cell></row><row><cell>mxbai-embed-large-v1</cell><cell>1024</cell><cell>512</cell><cell>64.68</cell><cell>✓</cell></row><row><cell>UAE-Large-V1</cell><cell>1024</cell><cell>512</cell><cell>64.64</cell><cell>✓</cell></row><row><cell>text-embedding-3-large</cell><cell>3072</cell><cell>8191</cell><cell>64.59</cell><cell>✗</cell></row><row><cell>Cohere embed-english-v3.0</cell><cell>1024</cell><cell>512</cell><cell>64.47</cell><cell>✗</cell></row><row><cell>bge-large-en-v1.5</cell><cell>1024</cell><cell>512</cell><cell>64.23</cell><cell>✓</cell></row><row><cell>bge-base-en-v1.5</cell><cell>768</cell><cell>512</cell><cell>63.55</cell><cell>✓</cell></row><row><cell>gte-large</cell><cell>1024</cell><cell>512</cell><cell>63.13</cell><cell>✓</cell></row><row><cell>gte-base</cell><cell>768</cell><cell>512</cell><cell>62.39</cell><cell>✓</cell></row><row><cell>text-embedding-3-small</cell><cell>1536</cell><cell>8191</cell><cell>62.26</cell><cell>✗</cell></row><row><cell>e5-large-v2</cell><cell>1024</cell><cell>512</cell><cell>62.25</cell><cell>✓</cell></row><row><cell>bge-small-en-v1.5</cell><cell>384</cell><cell>512</cell><cell>62.17</cell><cell>✓</cell></row><row><cell>e5-base-v2</cell><cell>768</cell><cell>512</cell><cell>61.5</cell><cell>✓</cell></row><row><cell>gte-small</cell><cell>384</cell><cell>512</cell><cell>61.36</cell><cell>✓</cell></row><row><cell>e5-small-v2</cell><cell>384</cell><cell>512</cell><cell>59.93</cell><cell>✓</cell></row><row><cell>gtr-t5-large</cell><cell>768</cell><cell>512</cell><cell>58.28</cell><cell>✓</cell></row><row><cell>sentence-t5-large</cell><cell>768</cell><cell>512</cell><cell>57.06</cell><cell>✓</cell></row><row><cell>gtr-t5-base</cell><cell>768</cell><cell>512</cell><cell>56.19</cell><cell>✓</cell></row><row><cell>sentence-t5-base</cell><cell>768</cell><cell>512</cell><cell>55.27</cell><cell>✓</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head></head><label></label><figDesc>1.00 0.29 0.32 0.30 0.21 0.29 0.24 0.19 0.34 0.32 0.26 0.24 0.22 0.28 0.30 0.13 0.16 0.47 0.38 0.29 1.00 0.36 0.59 0.23 0.27 0.23 0.19 0.35 0.35 0.36 0.28 0.18 0.21 0.76 0.12 0.14 0.31 0.29 0.32 0.36 1.00 0.33 0.26 0.28 0.24 0.19 0.37 0.44 0.30 0.29 0.18 0.20 0.35 0.13 0.12 0.32 0.34 0.30 0.59 0.33 1.00 0.24 0.23 0.22 0.16 0.35 0.35 0.31 0.25 0.18 0.20 0.59 0.10 0.13 0.30 0.30 0.21 0.23 0.26 0.24 1.00 0.19 0.17 0.13 0.26 0.25 0.21 0.26 0.13 0.16 0.25 0.10 0.10 0.23 0.20 0.29 0.27 0.28 0.23 0.19 1.00 0.34 0.24 0.34 0.27 0.26 0.25 0.18 0.23 0.27 0.12 0.14 0.29 0.32 0.24 0.23 0.24 0.22 0.17 0.34 1.00 0.24 0.30 0.22 0.24 0.22 0.15 0.17 0.25 0.10 0.13 0.23 0.25 0.19 0.19 0.19 0.16 0.13 0.24 0.24 1.00 0.19 0.17 0.18 0.16 0.13 0.14 0.19 0.10 0.10 0.22 0.21 0.34 0.35 0.37 0.35 0.26 0.34 0.30 0.19 1.00 0.30 0.29 0.26 0.21 0.23 0.37 0.12 0.14 0.33 0.36 0.32 0.35 0.44 0.35 0.25 0.27 0.22 0.17 0.30 1.00 0.43 0.38 0.16 0.21 0.37 0.12 0.14 0.33 0.33 0.26 0.36 0.30 0.31 0.21 0.26 0.24 0.18 0.29 0.43 1.00 0.38 0.17</figDesc><table /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has received funding from the European Union's Horizon Europe research and innovation program under grant agreement No 101070014 (OpenWebSearch.EU, https: //doi.org/10.3030/101070014).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Survey of hallucination in natural language generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Frieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ishii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Bang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="1" to="38" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Is your llm outdated? benchmarking llms &amp; alignment algorithms for time-sensitive knowledge</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Mousavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Alghisi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Riccardi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2404.08700</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Muennighoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tazi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Magne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.07316</idno>
		<title level="m">Mteb: Massive text embedding benchmark</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Klabunde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schumacher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Strohmaier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lemmerich</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.06329</idno>
		<title level="m">Similarity of neural network models: A survey of functional and representational measures</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Klabunde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">B</forename><surname>Amor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Granitzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lemmerich</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.02730</idno>
		<title level="m">Towards measuring representational similarity of large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability</title>
		<author>
			<persName><forename type="first">M</forename><surname>Raghu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gilmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yosinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sohl-Dickstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Insights on representational similarity in neural networks with canonical correlation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Morcos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Raghu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Canonical correlation analysis: An overview with application to learning methods</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Hardoon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Szedmak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shawe-Taylor</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neural computation</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="2639" to="2664" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Grounding representation similarity through statistical testing</title>
		<author>
			<persName><forename type="first">F</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-S</forename><surname>Denain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="1556" to="1568" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">On the similarity between hidden layers of pruned and unpruned convolutional neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Zullich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pellegrino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Medvet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ansuini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods</title>
				<meeting>the 9th International Conference on Pattern Recognition Applications and Methods</meeting>
		<imprint>
			<publisher>Scitepress</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="52" to="59" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Inner product-based neural network similarity</title>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Miao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Qiu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Convergent learning: Do different neural networks learn the same representations?</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yosinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clune</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lipson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hopcroft</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1511.07543</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Convergent learning: Do different neural networks learn the same representations?</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yosinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clune</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lipson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hopcroft</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v44/li15convergent.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Storcheus</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Rostamizadeh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</editor>
		<meeting>the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015<address><addrLine>PMLR, Montreal, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="page" from="196" to="212" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Similarity of neural network representations revisited</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kornblith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Norouzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v97/kornblith19a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 36th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Chaudhuri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</editor>
		<meeting>the 36th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">97</biblScope>
			<biblScope unit="page" from="3519" to="3529" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Revisiting model stitching to compare neural representations</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Bansal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nakkiran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Barak</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2021/file/01ded4259d101feb739b06c399e9cd9c-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Beygelzimer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Dauphin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Vaughan</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="225" to="236" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Understanding image representations by measuring their equivariance and equivalence</title>
		<author>
			<persName><forename type="first">K</forename><surname>Lenc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vedaldi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="991" to="999" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">On the functional similarity of robust and non-robust neural representations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Balogh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Jelasity</surname></persName>
		</author>
		<ptr target="https://proceedings.mlr.press/v202/balogh23a.html" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th International Conference on Machine Learning</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Krause</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Brunskill</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Engelhardt</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sabato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Scarlett</surname></persName>
		</editor>
		<meeting>the 40th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">202</biblScope>
			<biblScope unit="page" from="1614" to="1635" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Launch and iterate: Reducing prediction churn</title>
		<author>
			<persName><forename type="first">M</forename><surname>Milani Fard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Cormier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Canini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gupta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Diffchaser: Detecting disagreements for deep neural networks</title>
		<author>
			<persName><forename type="first">X</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ternational Joint Conferences on Artificial Intelligence Organization</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Modeldiff: testing-based dnn similarity comparison for model reuse detection</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.1145/3460319.3464816</idno>
		<idno>doi:10.1145/ 3460319.3464816</idno>
		<ptr target="http://dx.doi.org/10.1145/3460319.3464816" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA &apos;21</title>
				<meeting>the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA &apos;21</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Belinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Sajjad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Durrani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dalvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Glass</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.01172</idno>
		<title level="m">Similarity analysis of contextual word representation models</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Word embeddings revisited: Do llms offer something new?</title>
		<author>
			<persName><forename type="first">M</forename><surname>Freestone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K K</forename><surname>Santu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.11094</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Godfrey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Konz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kvinge</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.14993</idno>
		<title level="m">Understanding the inner workings of language models through representation dissimilarity</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Thakur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rücklé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.08663</idno>
		<title level="m">Beir: A heterogenous benchmark for zeroshot evaluation of information retrieval models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Finardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Avila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Castaldoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gengo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Larcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Piau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Costa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Caridá</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.07883</idno>
		<title level="m">The chronicles of rag: The retriever, the chunk and the generator</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<title level="m" type="main">Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers</title>
		<author>
			<persName><forename type="first">K</forename><surname>Sawarkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mangal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Solanki</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2404.07220</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Es</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>James</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Espinosa-Anke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schockaert</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.15217</idno>
		<title level="m">Ragas: Automated evaluation of retrieval augmented generation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Measuring statistical dependence with hilbert-schmidt norms</title>
		<author>
			<persName><forename type="first">A</forename><surname>Gretton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bousquet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Smola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Schölkopf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Algorithmic Learning Theory</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Jain</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><forename type="middle">U</forename><surname>Simon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Tomita</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg; Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="63" to="77" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Towards understanding the instability of network embedding</title>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Guan</surname></persName>
		</author>
		<idno type="DOI">10.1109/TKDE.2020.2989512</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="927" to="941" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Inc</surname></persName>
		</author>
		<ptr target="https://docs.trychroma.com/" />
		<title level="m">Chroma, Chroma Homepage</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Jiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2212.03533</idno>
		<title level="m">Text embeddings by weaklysupervised contrastive pre-training</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Ni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">H</forename><surname>Ábrego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Constant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">B</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2108.08877</idno>
		<title level="m">Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<title level="m" type="main">Large dual encoders are generalizable retrievers</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">H</forename><surname>Ábrego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">B</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.07899</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Muennighoff</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.07597</idno>
		<title level="m">C-pack: Packaged resources to advance general chinese embedding</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Long</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.03281</idno>
		<title level="m">Towards general text embeddings with multi-stage contrastive learning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<ptr target="https://openai.com/blog/new-embedding-models-and-api-updates" />
		<title level="m">New embedding models with lower pricing, OpenAI Blog</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<author>
			<persName><surname>Cohere</surname></persName>
		</author>
		<ptr target="https://cohere.com/embeddings" />
		<title level="m">Embeddings -text embeddings with advanced language models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>Cohere Homepage</note>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shakir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Koenig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lipp</surname></persName>
		</author>
		<ptr target="https://www.mixedbread.ai/blog/mxbai-embed-large-v1" />
		<title level="m">Open source strikes bread -new fluffy embeddings model</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.12871</idno>
		<title level="m">Angle-optimized text embeddings</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Joty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yavuz</surname></persName>
		</author>
		<ptr target="https://blog.salesforceairesearch.com/sfr-embedded-mistral/" />
		<title level="m">Sfr-embedding-mistral:enhance text retrieval with transfer learning, Salesforce AI Research Blog</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Scikitlearn: Machine learning in Python</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gramfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thirion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Grisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Prettenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dubourg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanderplas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Passos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cournapeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Perrot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Duchesnay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2825" to="2830" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">seaborn: statistical data visualization</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">L</forename><surname>Waskom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Open Source Software</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<monogr>
		<title/>
		<idno type="DOI">10.21105/joss.03021</idno>
		<ptr target="https://doi.org/10.21105/joss.03021.doi:10.21105/joss.03021" />
		<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
