1. Introduction

BES4RAG: A Framework for Embedding Model Selection in Retrieval-Augmented Generation

Lorenzo Canale

Stefano Scotta

Alberto Messina

Laura Farinetti

0 0 Politecnico di Torino , Corso Duca degli Abruzzi 24, 10129, Turin , Italy 1 RAI - Centro Ricerche, Innovazione Tecnologica e Sperimentazione , Via Giovanni Carlo Cavalli 6, 10138, Turin , Italy

2025

Embedding model selection is a crucial step in optimizing Retrieval-Augmented Generation (RAG) systems. In this paper, we introduce BES4RAG, a framework designed to evaluate embedding models based on question-answering accuracy rather than standard retrieval metrics. BES4RAG automates dataset processing, automatic question generation, passage indexing, retrieval, and answer evaluation to determine the optimal embedding model for specific datasets. Experimental results on three diverse datasets confirm that embedding choice significantly afects performance, varies across datasets, and can enable smaller LLMs to outperform larger ones when paired with the right embeddings. Additionally, since a key component of this framework is automatic question generation, we found that its performance closely aligns with manually crafted questions, as evidenced by the Pearson correlation between the two.

eol>Embedding Model Selection Automatic Question Generation Evaluation Framework Retrieval-Augmented Generation (RAG)

1. Introduction BES4RAG implements a fully automated pipeline that

processes datasets, generates multiple-choice questions Retrieval-Augmented Generation (RAG) has emerged (MCQs) using an LLM, indexes passages using diferent as a powerful approach for improving the factual accu- embedding models, retrieves relevant documents, and racy and contextual relevance of Large Language Models evaluates the accuracy of generated answers. By com(LLMs) by incorporating external knowledge sources [ 1 ]. paring retrieval-augmented responses across diferent A crucial component of a RAG system is the embedding embeddings and LLM configurations, BES4RAG enables model, which converts textual data into vector represen- practitioners to identify the best embedding model for tations for retrieval [ 2, 3, 4, 5 ]. Standard retrieval metrics their specific dataset and use case. like Recall@k, Mean Reciprocal Rank (MRR), Normalized We used BES4RAG to conduct a series of experiments Discounted Cumulative Gain (NDCG), Mean Average on three diverse types of datasets: news articles, TV Precision (MAP), and Precision at some cutof (Preci- program transcripts, and movie-related data — including sion@k) are commonly used to evaluate embeddings [ 6 ], both scripts and additional metadata — each with varying but they do not always reflect how well retrieved pas- lengths and characteristics, addressing three key research sages enhance answer quality. Additionally, these met- questions. rics require knowing the source document of key answer components, yet this information is not always easily RQ1 Are optimal embedding choices datasetaccessible. dependent? We demonstrate that diferent

In this work, we introduce BES4RAG, a framework datasets yield significantly diferent optimal emdesigned to address these limitations by focusing on beddings, reinforcing the importance of datasetevaluating embedding models based on their impact on specific selection. question-answering accuracy, rather than relying solely on traditional retrieval metrics.

RQ2 Can small LLMs outperform larger models

when paired with the right embeddings? Our ifndings suggest that embedding quality can play a more significant role than LLM size, highlighting the necessity of embedding optimization.

RQ3 Do results from automatically generated ques

tions correlate with those from manually created ones? We validate that automated question evaluation is a reliable proxy for humangenerated assessments, confirming the robustness of BES4RAG’s methodology.

In summary, our results emphasize the importance of In parallel, the automatic generation of questions using evaluating embedding models based on their impact on LLMs has gained attention, especially in educational and question-answering accuracy, with a methodology that evaluation contexts. In [ 13 ] it is presented a system that minimizes user efort through the automatic generation allows users to specify a question type (e.g., reading, of questions. speaking, or listening) and a base text, from which the system automatically generates questions accordingly.

A more structured approach with PFQS (Planning First, 2. Related Work Question Second) is proposed in [14], in which Llama 2 generates an answer plan that is then used to produce The Massive Text Embedding Benchmark (MTEB) pro- relevant questions. While these methods demonstrate vides a valuable overview of the performance of hun- the potential of LLMs for generating educational content, dreds of embedding models across a variety of tasks and the systematic use of automatically generated questions datasets [ 7 ]. However, it also presents some limitations. for evaluating embedding performance in RAG systems Even when models are evaluated on multiple datasets remains underexplored and merits further investigation. for a given task, these datasets rarely match the specific characteristics — such as language, document length, or corpus size — of the data a user might use to build a RAG 3. BES4RAG: A Framework for system. Additionally, for retrieval tasks, the evaluation metrics adopted by MTEB may not be fully appropriate Selecting Embeddings in RAG. in scenarios where the same information is spread across BES4RAG (Benchmarking Embeddings for Selection in multiple documents. In such cases, the ranking of individ- RAG) is a modular framework written in Python code ual documents becomes less meaningful, as the relevant and designed to assess embedding models end-to-end by information is redundantly present in several of them. evaluating their performance in the full RAG pipeline,

For these reasons, new evaluation methods are emerg- rather than relying solely on pre-retrieval metrics. ing in the literature that incorporate Large Language Mod- BES4RAG difers from conventional evaluation methels (LLMs) [ 8 ]. For example, in [ 9 ], the capabilities of ods by integrating automated question generation and ChatGPT and Llama2 are leveraged to evaluate embed- response evaluation within the RAG loop. This enables ding models in the context of RAG. Instead of relying a direct comparison of how diferent embeddings afect solely on retrieval metrics, ChatGPT is used to rank the the final output quality, making the framework suitable relevance and usefulness of the context retrieved by dif- for real-world, task-specific deployment. ferent embedding models. In [ 10 ], the authors propose a The framework, depicted in Figure 1, is publicly availclustering-based approach to analyze the behavior of em- able on GitHub.1 In the following sections, we describe bedding models within RAG systems. By grouping models the individual pipeline modules. into families based on their retrieval characteristics, the study reveals that top-k retrieval similarity can show high variance across diferent model families, especially at 3.1. Data Preprocessing: File Conversion lower values of . This highlights how seemingly similar and Organization models may behave quite diferently in practice, reinforcing the importance of dataset-specific and task-aware The preprocessing phase is handled by a module that embedding evaluation. More recent work has further ingests a variety of input formats—namely JSON, TXT, emphasized the importance of considering embedding and PDF files—and converts them into plain text for performance specifically within RAG pipelines. Sakar downstream processing. This module also creates a and Emekci, in [ 11 ], show that balancing context qual- file_mapping.json file, which records the corresponity with similarity-based ranking is crucial, along with dence between the original input and the resulting text understanding trade-ofs related to token usage, runtime, ifles. Optionally, a brief textual description can be associand hardware constraints. Their findings highlight the ated with each input document. This description can be role of contextual compression filters in improving hard- generated automatically based on the original filename or ware eficiency and reducing token consumption, despite derived from the content using a large language model their efect on similarity scores. Similarly, in [ 12 ] CO- (LLM); alternatively, the user can manually specify it. COM is introduced, a context compression method that This step ensures that the dataset is normalized, forming reduces long input contexts to a small set of compact the foundation for consistent question generation and embeddings. This approach significantly accelerates gen- passage segmentation in later stages. eration time by mitigating the overhead introduced by lengthy contextual inputs, which directly impacts user latency. 3.2. Automatic Questions Generation 3.4. Passages Indexing A central component of BES4RAG is the automatic gener- The segmented passages are embedded using one or more ation of MCQs from the input text. Using a LLM, the embedding models via the indexer module. This modquestions_generator module selects random text ule computes and stores vector representations of the segments from the normalized dataset and formulates passages.

MCQs based on a customizable prompt template. The standard prompt used for question generation is in Figure 3.5. Passages Retrieval 2. The questions are stored in JSON format. 3.3. Text Segmentation Once the dataset is converted into text files, it is segmented into passages suitable for indexing. The passages_generator module performs this task by applying a specified tokenizer to the input text. A key consideration in this process is that the segmentation into passages is determined by the embedding model being used since the tokenizers have a maximum token length. By default, the framework uses the maximum token length supported. However, it is possible to specify a smaller token length.

Given a set of questions and indexed embeddings, the passages_retriever module ranks the passages based on similarity, typically using cosine similarity, though other similarity metrics can be employed depending on the embedding model. The retrieved passages are then stored, organized by embedding model, allowing for lfexible experimentation with diferent top- retrieval sizes. 3.6. Question Answering

Using the retrieved passages and corresponding questions, the questions_answering module evaluates how well an LLM can answer each question in a RAG

Create a multiple-choice question in the same language as the text below, based solely on its content. ---------------------------<<<text>>> ---------------------------The question must be generic and must not contain references to the article (e.g., "in the article..." or "based on the text").

If the text mentions a specific event, include full details (e.g., name of war, date if available). Avoid vague temporal references like "today." Generate 4 answer options (1 correct, 3 plausible but incorrect), each with an explanation of why it is correct or not, based only on the text.

Return your answer in this JSON format: { "question": "...", "options": [

{ } setup. For each value of (with default values of = 0, 1, 2, 3, 4, 5, 10), the module combines the top- retrieved passages with the question prompt and queries an LLM to generate an answer. The prompt used for let the LLM answer the questions is in Figure 3. The results are stored in structured JSON files, organized by embedding and LLM configuration. The final module, q&a_evaluator, assesses the performance of the RAG system across diferent embeddings by computing the answer accuracy over all questions. For each embedding model and retrieval configuration (e.g., varying ), the module calculates accuracy and generates a plot to visualize performance. This plot is crucial for identifying the embedding model that leads to the best overall performance in the specific domain or dataset under analysis. Additionally, it helps determine the optimal value of for the considered task. This evaluation also enables a comparison between free and open-source embedding models and their proprietary counterparts, providing insights into the trade-ofs between computational cost and accuracy.

4. Experimental Setup

In this section, we describe the experimental setup used to evaluate the performance of the proposed system. We ifrst provide an overview of the datasets used, followed by details about the embedding models and LLMs employed in the pipeline. Finally, we explain the evaluation metric adopted to measure the system’s performance in answering questions. 4.1. Datasets We evaluate our system on three distinct datasets, each representing a diferent domain and content type. These datasets were selected to test the system’s versatility and ability to generalize across varying text types, from news articles to transcripts of TV programs and movie scripts. • RaiNews: This dataset consists of approximately 16,000 news articles, from the RaiNews portal, covering a wide range of topics from current events. The articles are typically short and serve as concise textual documents, ideal for testing the system’s ability to retrieve and generate answers from concise content. • Medicina33: This dataset includes roughly 159 full transcripts from the Medicina 33 TV program. This Italian television program focuses on medical topics, with discussions featuring experts in the field of medicine. The transcripts are longer with respect to the news, making them suitable for testing the system’s handling of more complex, specialized content. • Movies: This dataset comprises approximately 2,000 movie scripts, metadata, and reviews. It includes both short and long documents, providing a diverse set of examples ranging from concise summaries to lengthy dialogues. This dataset is intended to evaluate the system’s performance on text with a narrative structure and its ability to handle various types of content, such as reviews and scripts. • dunzhanq/stella_en_1.5B_v56 [18]: A largescale transformer model fine-tuned for English sentence-level tasks, designed to provide powerful embeddings for more complex textual data.

Remark 1. We selected primarily multilingual embed

The RaiNews and Medicina33 datasets are in Italian, ding models since our experiment involves two datasets while the Movies dataset is in English. in Italian and one in English (see Section 4.1), to reduce potential mismatches between dataset languages and model training data. This choice ensures broader lan4.2. Embedding Models guage coverage and more robust cross-lingual represenIn our experiments, we distinguish between three main tations. However, BES4RAG does not aim to recommend families of embedding models: ColBERT, OpenAI embed- a specific model a priori, but rather to evaluate a userdings, and Sentence Transformers. defined set of models and identify the best-performing

The ColBERT model, described in [15], is a state-of-the- one for the dataset considered. art method for eficient and efective passage retrieval. To compare the embeddings produced by these models, ColBERT uses a bi-level representation of text, allowing the most common similarity measure is cosine similarity, for a more compact and computationally eficient rep- which computes the cosine of the angle between two resentation of passages. The antoinelouis/colbert-xm2 vectors, capturing their relative orientation in the emmodel, based on this framework, is a multilingual variant, bedding space. Cosine similarity is used for all models providing advantages in multilingual tasks by capturing in our setup except for those in the ColBERT family. For semantic meaning in multiple languages simultaneously. the latter, such as antoinelouis/colbert-xm, we instead

Openai ofers a range of powerful models for generat- use the MaxSim function, a more specialized similarity ing embeddings from text, including the text-embedding- measure designed for passage retrieval that works by 3-large3 model. The main disadvantage of these models ifrst computing the similarity between each individual is that they are proprietary, and the vector representation query token and each document token using a similarity is available only through a paid API. metric like cosine similarity; it then takes the maximum

The Sentence Transformers family includes several mod- of these token-level similarities as the final relevance els optimized for sentence-level embeddings. score between the query and the document. • intfloat/multilingual-e5-large 4[16]: A multilin- Finally, for all datasets, the maximum token limits for gual model capable of generating high-quality embeddings were applied to split the textual data into embeddings for text in multiple languages. passages, except for the OpenAI model text-embedding3-large (512 token limit), which is the same model as • sentence-transformers/all-MiniLM-L6-v25 [17]: text-embedding-3-large but with maximum tokens length A smaller, faster variant of the BERT model, pro- limited to 512. The decision of considering also this case viding eficient sentence embeddings while main- was made based on the observation that increasing the taining a high degree of accuracy for various NLP size of passages, although possible with this model, does tasks. not necessarily improve the quality of the retrieved information. This will become clear when observing the results in Section 5. 2https://huggingface.co/antoinelouis/colbert-xm 3https://platform.openai.com/docs/models/ text-embedding-3-large 4https://huggingface.co/intfloat/multilingual-e5-large 5https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

6https://huggingface.co/dunzhanq/stella_en_1.5B_v5

4.3. Large Language Models

In our experimental setup, we employed two distinct

families of LLMs for the generation of questions and answering, respectively. For question generation, 5. Results and Discussion GPT-4o7 model was adopted through the OpenAI API.

For answering, we adopted two variants of the LLaMA RQ1: Optimal embedding choices vary 3.1 series developed by Meta: the 70-billion parameter across datasets model meta-llama/Llama-3.1-70B-Instruct8 and the smaller 8-billion parameter version As observed in Figure 4, the accuracy of the Llama 3.1 meta-llama/Llama-3.1-8B-Instruct9. 70B model on automatically generated questions exhibits

To ensure consistency and reduce stochastic variation variations not only with the number of retrieved docuacross outputs, a temperature of 0 was used during infer- ments, but also with respect to the choice of embedding ence for all models. Additionally, for answer generation model. The ranking of the embedding models varies tasks, the maximum output length was restricted to a across datasets, as demonstrated by the diferent persingle token, since the expected answer is always a dis- formance patterns observed in the first and subsequent crete value in the set {0, 1, 2, 3}, in accordance with the positions. This variation highlights the dataset-specific prompt specification described in Section 3.6. characteristics that influence the eficacy of embedding models, further emphasizing the utility of the proposed 4.4. Evaluation Metric framework for selecting the optimal embeddings for each dataset, rather than relying on a one-size-fits-all approach. often require detailed annotations that are not always available.

RQ2: Small LLMs can outperform bigger LLMs with the right embedding Unlike to what is done in [ 9 ], we do not aim to evaluate the performance of our embedding models using a LLM as an external judge. In other words, we do not rely on the LLM to assess the quality of the retrieved passages or to rate their relevance. Instead, we consider the end goal of the pipeline: whether the final multiple-choice answer produced by the RAG system is correct.

To this end, we introduce a simple yet informative metric that we refer to as Question Answering Accuracy — or simply accuracy in the remainder of this paper. For each question, the system selects an answer option based on the response generated by the LLM, using the passages retrieved by the embedding model. The accuracy is computed as the proportion of questions for which the selected answer matches the correct one, as defined in the ground truth. This metric directly reflects the efectiveness of the entire RAG pipeline in producing correct answers, integrating both retrieval and generation performance.

In some cases, the choice of the embedding model may

be even more critical than selecting the most powerful LLM within a RAG system. This hypothesis is supported by experimenting BES4RAG using two diferent LLMs framework on the same dataset and with the same embedding models. As shown in Figure 5, these experiments demonstrate that using a more efective embedding model with a smaller LLM can lead to better performance than relying on a more powerful LLM combined with weaker embedding models. In particular, LLama 3.1 8B, when paired with antoinelouis/colbert-xm, intfloat/multilingual-e5-large, or text-embedding-3-large, outperforms the larger LLama 3.1 70B when the latter is combined with sentence-transformers/all-MiniLM-L6-v2 Remark 2. Theoretically, the pipeline could be adapted to or dunzhanq/stella_en_1.5B_v5, at least for lower values incorporate standard retrieval metrics such as those men- of . Indeed, for higher values of , the performance tioned in Section 1, by changing the question generation of the smaller LLM deteriorates, likely due to the inmodule so that questions are generated from individual creased prompt length exceeding its optimal processing passages rather than from full documents. However, we capacity. These experiments highlight the importance of adopt the Question Answering Accuracy metric for its di- carefully evaluating the choice of the embedding model, rect alignment with the end goal of the RAG pipeline: especially when considering the use of smaller LLMs. In selecting the embedding that enables correct answers. fact, selecting an efective embedding model can enable While we acknowledge its binary nature and the lack the adoption of smaller language models, thus reducing of granularity in capturing partial understanding or pas- computational requirements and leading to more costsage quality, we consider this trade-of acceptable for an efective and resource-eficient solutions. automated evaluation setup. More expressive metrics

7https://openai.com/index/hello-gpt-4o/

8https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct 9https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct (a) RaiNews (b) Medicina33 (c) Movies

1,414 questions created by approximately eighty students

enrolled in an undergraduate database course. These students were instructed to formulate meaningful and unambiguous multiple-choice questions based on the movies scripts, plots and metadata.

We then compared the accuracy scores obtained using Figure 5: Accuracy comparison between Llama 3.1 8B and these human-authored questions with the automatically

Llama 3.1 70B on automatically generated questions from generated ones for the Movies dataset. Specifically, for the Rainews dataset depending on the embedding models each embedding model and for each value of in the (the ones in Section 4.2, here with shortened names) and the top- retrieval, we computed the accuracy of the final number of retrieved documents used to answer the questions answers returned by the RAG pipeline. This yielded two (x-axis). matrices of scores: one for manual questions and one for automatically generated questions, where rows correspond to diferent embedding models and columns to RQ3: Automatically generated and diferent values. user-generated questions We then calculated the Pearson correlation coeficient between the corresponding entries of these two matrices To assess whether evaluation using automatically gener- to quantify the alignment between the two evaluation ated questions provides results consistent with human- modes. As shown in Table 2, the raw accuracy values authored ones, we relied on a manually curated set of already exhibit a strong correlation ( = 0.78). When applying min-max normalization per row (i.e., within each embedding), the correlation improves slightly ( = 0.80), indicating that the relative behavior of each model across diferent remains consistent. Finally, full matrix-wise normalization further increases the correlation to = 0.90, suggesting a strong structural similarity between the two evaluation matrices. These findings support the use of automatically generated questions as a viable proxy for manual evaluation.

Remark 3. In addition to the quantitative correlation analysis, we manually inspected a random sample of both human and automatically generated questions to assess their coherence and correctness. The review confirmed a high level of quality in both sets. The automatically generated questions typically referred to more specific and localized portions of the source text. Anyway, the strong correlation observed between the two evaluation modes further supports the use of automatically generated questions as a reliable and eficient benchmark for assessing embedding model performance.

6. Conclusion and Future Work In this work, we presented BES4RAG, a modular frame

work for the evaluation of embedding models in retrievalaugmented generation (RAG) pipelines. The framework provides a comprehensive approach by focusing on endto-end evaluation, incorporating automatic question generation, passage segmentation, and answer evaluation.

Unlike traditional methods, which rely on pre-retrieval metrics, BES4RAG integrates task-specific performance assessments, allowing for a more accurate comparison of embedding models based on their impact on the final output.

BES4RAG is also versatile, making it suitable for a variety of use cases, including datasets that represent subsets of larger corpora. A prime example would be transcribed multimedia archives, where smaller portions of the dataset can be used to efectively represent the entire collection.

Although BES4RAG demonstrates strong performance and general applicability across diverse datasets, it is not without limitations. One notable limit lies in its reliance on automatically generated MCQs, which, although eficient and scalable, may not always be adequate in highly domain-specific contexts, i.e. in technical or expert-driven fields where factual precision or nuanced phrasing is critical. Furthermore, the binary nature of the evaluation metric is easily interpretable, but it can fail to capture partial understanding, near-miss responses, or the contextual relevance of the retrieved passages. This trade-of between simplicity and expressiveness, while intentional for automation and reproducibility, highlights the need for complementary metrics or qualitative assessments in more complex scenarios.

Looking ahead, avenues for future work include the following: • Investigating whether using two diferent LLMs for question generation and retrieval provides better performance or if using the same LLM for both tasks yields comparable results. • Exploring alternative methods for question gener

ation that consider larger portions of documents. • Introducing new metrics to assess questions without options, potentially linking detailed answers back to one of the predefined options, ofering more flexibility in evaluating the question-answer generation process. • Integrate within the pipeline some element that returns statistical significance measures of the results obtained, such as paired tests to assess whether diferences between embedding models are statistically significant. Moreover, regarding the evaluation of LLM’s answers it could be interesting to analyze the token-level probability distribution to assess how embeddings afect the confidence of LLM predictions. • Study the scalability of the proposed approach on significantly larger datasets, evaluating both its performance and reliability under increased data volume, as well as the computational time and resource requirements of the entire pipeline. Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Lewis ,

Perez ,

Piktus ,

Petroni ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W.-t. Yih,

Rocktäschel ,

Riedel ,

Kiela , Retrievalaugmented generation for knowledge-intensive nlp tasks , in: Proceedings of the 34th International Conference on Neural Information Processing Systems , NIPS '20, Curran Associates Inc., Red

Hook

, NY , USA, 2020 , pp. 9459 - 9474 .

[2]

Egger , Text Representations and Word Embeddings , Springer International Publishing, Cham, 2022 , pp. 335 - 361 . URL: https://doi.org/10.1007/978-3- 030 -88389-8_ 16 . doi: 10 .1007/978-3- 030 -88389-8_ 16 .

[3]

Kim , J. Springer, A. Raghunathan,

Sap , Mitigating bias in rag: Controlling the embedder , 2025 . URL: https://arxiv.org/abs/2502.17390. arXiv: 2502 . 17390 .

[4]

Reimers , I. Gurevych , Sentence-bert: Sentence eration learning system for improve users' literembeddings using siamese bert-networks , in: Pro- acy skills, The Journal of the Korea institute of ceedings of the 2019 Conference on Empirical Meth- electronic communication sciences 19 ( 2024 ) 1243 - ods in Natural Language Processing, Association 1248. for Computational Linguistics , 2019 . URL: https: [14]

Li ,

Zhang , Planning first, question sec//arxiv.org/abs/ 1908 .10084. ond: An LLM-guided method for controllable

[5]

Wang ,

Koopman , Semantic embedding question generation , in: L. -W. Ku , A. Marfor information retrieval , in: 5th Workshop on tins, V. Srikumar (Eds.), Findings of the AsBibliometric-Enhanced Information Retrieval, BIR sociation for Computational Linguistics: ACL 2017 , CEUR, 2017 , pp. 122 - 132 . 2024 , Association for Computational Linguistics,

[6]

Radlinski ,

Craswell , Comparing the sensi- Bangkok, Thailand, 2024 , pp. 4715 - 4729 . URL: https: tivity of information retrieval metrics , in: Pro- //aclanthology.org/ 2024 .findings-acl. 280 /. doi: 10. ceedings of the 33rd International ACM SIGIR 18653/v1/2024.findings-acl.280 . Conference on Research and Development in In- [15]

Khattab ,

Zaharia , Colbert: Eficient and efformation Retrieval, SIGIR '10, Association for fective passage search via contextualized late interComputing Machinery , New York, NY, USA, 2010 , action over bert, 2020 . URL: https://arxiv.org/abs/ p. 667 - 674 . URL: https://doi.org/10.1145/1835449. 2004 . 12832 . arXiv: 2004 . 12832 . 1835560. doi: 10 .1145/1835449.1835560. [16]

Wang ,

Yang ,

Huang ,

Yang ,

Majumder ,

[7]

Muennighof ,

Tazi ,

Magne ,

Reimers ,

Wei , Multilingual e5 text embeddings: A technical MTEB: Massive text embedding benchmark , in: report , 2024 . URL: https://arxiv.org/abs/2402.05672. A. Vlachos , I. Augenstein (Eds.), Proceedings arXiv:2402.05672. of the 17th Conference of the European Chap - [17]

Wang ,

Wei ,

Dong ,

Bao ,

Yang , M. Zhou, ter of the Association for Computational Lin- Minilm: Deep self-attention distillation for taskguistics, Association for Computational Linguis- agnostic compression of pre-trained transformtics , Dubrovnik, Croatia, 2023 , pp. 2014 - 2037 . ers, 2020 . URL: https://arxiv.org/abs/ 2002 .10957. URL: https://aclanthology.org/ 2023 .eacl-main. 148 /. arXiv: 2002 .10957. doi: 10 .18653/v1/ 2023 .eacl-main. 148 . [18]

Zhang ,

Li ,

Zeng ,

Wang , Jasper

[8]

Isbarov ,

Huseynova , Enhanced document re- and stella: distillation of sota embedding modtrieval with topic embeddings , in: 2024 IEEE 18th els , 2025 . URL: https://arxiv.org/abs/2412.19048. International Conference on Application of Infor- arXiv:2412.19048. mation and Communication Technologies (AICT) , 2024 , pp. 1 - 5 . doi: 10 .1109/AICT61888. 2024 . 10740455 .

[9]

Kukreja ,

Kumar ,

Bharate ,

Purohit ,

Dasgupta ,

Guha , Performance evaluation of vector embeddings with retrieval-augmented generation , in: 2024 9th International Conference on Computer and Communication Systems (ICCCS) , 2024 , pp. 333 - 340 . doi: 10 .1109/ ICCCS61882. 2024 . 10603291 .

[10]

Caspari ,

K. G.

Dastidar ,

Zerhoudi ,

Mitrovic ,

Granitzer , Beyond benchmarks: Evaluating embedding model similarity for retrieval augmented generation systems , 2024 . URL: https://arxiv.org/ abs/2407.08275. arXiv: 2407 . 08275 .

[11]

Şakar ,

Emekci , Maximizing rag eficiency: A comparative analysis of rag methods , Natural Language Processing 31 ( 2024 ) 1 - 25 . doi: 10 .1017/ nlp. 2024 . 53 .

[12]

Rau ,

Wang ,

Déjean ,

Clinchant ,

Kamps , Context embeddings for eficient answer generation in retrieval-augmented generation , in: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining , 2025 , pp. 493 - 502 .

[13]

J.-S.

Park , S.-M. Park, Llm-based question gen-