<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BES4RAG: A Framework for Embedding Model Selection in Retrieval-Augmented Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Canale</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Scotta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Messina</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Farinetti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Politecnico di Torino</institution>
          ,
          <addr-line>Corso Duca degli Abruzzi 24, 10129, Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RAI - Centro Ricerche, Innovazione Tecnologica e Sperimentazione</institution>
          ,
          <addr-line>Via Giovanni Carlo Cavalli 6, 10138, Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Embedding model selection is a crucial step in optimizing Retrieval-Augmented Generation (RAG) systems. In this paper, we introduce BES4RAG, a framework designed to evaluate embedding models based on question-answering accuracy rather than standard retrieval metrics. BES4RAG automates dataset processing, automatic question generation, passage indexing, retrieval, and answer evaluation to determine the optimal embedding model for specific datasets. Experimental results on three diverse datasets confirm that embedding choice significantly afects performance, varies across datasets, and can enable smaller LLMs to outperform larger ones when paired with the right embeddings. Additionally, since a key component of this framework is automatic question generation, we found that its performance closely aligns with manually crafted questions, as evidenced by the Pearson correlation between the two.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Embedding Model Selection</kwd>
        <kwd>Automatic Question Generation</kwd>
        <kwd>Evaluation Framework</kwd>
        <kwd>Retrieval-Augmented Generation (RAG)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>BES4RAG implements a fully automated pipeline that</title>
        <p>
          processes datasets, generates multiple-choice questions
Retrieval-Augmented Generation (RAG) has emerged (MCQs) using an LLM, indexes passages using diferent
as a powerful approach for improving the factual accu- embedding models, retrieves relevant documents, and
racy and contextual relevance of Large Language Models evaluates the accuracy of generated answers. By
com(LLMs) by incorporating external knowledge sources [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. paring retrieval-augmented responses across diferent
A crucial component of a RAG system is the embedding embeddings and LLM configurations, BES4RAG enables
model, which converts textual data into vector represen- practitioners to identify the best embedding model for
tations for retrieval [
          <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
          ]. Standard retrieval metrics their specific dataset and use case.
like Recall@k, Mean Reciprocal Rank (MRR), Normalized We used BES4RAG to conduct a series of experiments
Discounted Cumulative Gain (NDCG), Mean Average on three diverse types of datasets: news articles, TV
Precision (MAP), and Precision at some cutof (Preci- program transcripts, and movie-related data — including
sion@k) are commonly used to evaluate embeddings [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], both scripts and additional metadata — each with varying
but they do not always reflect how well retrieved pas- lengths and characteristics, addressing three key research
sages enhance answer quality. Additionally, these met- questions.
rics require knowing the source document of key answer
components, yet this information is not always easily RQ1 Are optimal embedding choices
datasetaccessible. dependent? We demonstrate that diferent
        </p>
        <p>In this work, we introduce BES4RAG, a framework datasets yield significantly diferent optimal
emdesigned to address these limitations by focusing on beddings, reinforcing the importance of
datasetevaluating embedding models based on their impact on specific selection.
question-answering accuracy, rather than relying solely
on traditional retrieval metrics.</p>
        <sec id="sec-1-1-1">
          <title>RQ2 Can small LLMs outperform larger models</title>
          <p>when paired with the right embeddings? Our
ifndings suggest that embedding quality can play
a more significant role than LLM size, highlighting
the necessity of embedding optimization.</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>RQ3 Do results from automatically generated ques</title>
          <p>tions correlate with those from manually
created ones? We validate that automated
question evaluation is a reliable proxy for
humangenerated assessments, confirming the robustness
of BES4RAG’s methodology.</p>
          <p>
            In summary, our results emphasize the importance of In parallel, the automatic generation of questions using
evaluating embedding models based on their impact on LLMs has gained attention, especially in educational and
question-answering accuracy, with a methodology that evaluation contexts. In [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] it is presented a system that
minimizes user efort through the automatic generation allows users to specify a question type (e.g., reading,
of questions. speaking, or listening) and a base text, from which the
system automatically generates questions accordingly.
          </p>
          <p>
            A more structured approach with PFQS (Planning First,
2. Related Work Question Second) is proposed in [14], in which Llama 2
generates an answer plan that is then used to produce
The Massive Text Embedding Benchmark (MTEB) pro- relevant questions. While these methods demonstrate
vides a valuable overview of the performance of hun- the potential of LLMs for generating educational content,
dreds of embedding models across a variety of tasks and the systematic use of automatically generated questions
datasets [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. However, it also presents some limitations. for evaluating embedding performance in RAG systems
Even when models are evaluated on multiple datasets remains underexplored and merits further investigation.
for a given task, these datasets rarely match the specific
characteristics — such as language, document length, or
corpus size — of the data a user might use to build a RAG 3. BES4RAG: A Framework for
system. Additionally, for retrieval tasks, the evaluation
metrics adopted by MTEB may not be fully appropriate Selecting Embeddings in RAG.
in scenarios where the same information is spread across BES4RAG (Benchmarking Embeddings for Selection in
multiple documents. In such cases, the ranking of individ- RAG) is a modular framework written in Python code
ual documents becomes less meaningful, as the relevant and designed to assess embedding models end-to-end by
information is redundantly present in several of them. evaluating their performance in the full RAG pipeline,
          </p>
          <p>
            For these reasons, new evaluation methods are emerg- rather than relying solely on pre-retrieval metrics.
ing in the literature that incorporate Large Language Mod- BES4RAG difers from conventional evaluation
methels (LLMs) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. For example, in [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], the capabilities of ods by integrating automated question generation and
ChatGPT and Llama2 are leveraged to evaluate embed- response evaluation within the RAG loop. This enables
ding models in the context of RAG. Instead of relying a direct comparison of how diferent embeddings afect
solely on retrieval metrics, ChatGPT is used to rank the the final output quality, making the framework suitable
relevance and usefulness of the context retrieved by dif- for real-world, task-specific deployment.
ferent embedding models. In [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], the authors propose a The framework, depicted in Figure 1, is publicly
availclustering-based approach to analyze the behavior of em- able on GitHub.1 In the following sections, we describe
bedding models within RAG systems. By grouping models the individual pipeline modules.
into families based on their retrieval characteristics, the
study reveals that top-k retrieval similarity can show high
variance across diferent model families, especially at 3.1. Data Preprocessing: File Conversion
lower values of . This highlights how seemingly similar and Organization
models may behave quite diferently in practice,
reinforcing the importance of dataset-specific and task-aware The preprocessing phase is handled by a module that
embedding evaluation. More recent work has further ingests a variety of input formats—namely JSON, TXT,
emphasized the importance of considering embedding and PDF files—and converts them into plain text for
performance specifically within RAG pipelines. Sakar downstream processing. This module also creates a
and Emekci, in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], show that balancing context qual- file_mapping.json file, which records the
corresponity with similarity-based ranking is crucial, along with dence between the original input and the resulting text
understanding trade-ofs related to token usage, runtime, ifles. Optionally, a brief textual description can be
associand hardware constraints. Their findings highlight the ated with each input document. This description can be
role of contextual compression filters in improving hard- generated automatically based on the original filename or
ware eficiency and reducing token consumption, despite derived from the content using a large language model
their efect on similarity scores. Similarly, in [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] CO- (LLM); alternatively, the user can manually specify it.
COM is introduced, a context compression method that This step ensures that the dataset is normalized, forming
reduces long input contexts to a small set of compact the foundation for consistent question generation and
embeddings. This approach significantly accelerates gen- passage segmentation in later stages.
eration time by mitigating the overhead introduced by
lengthy contextual inputs, which directly impacts user
latency.
3.2. Automatic Questions Generation
3.4. Passages Indexing
A central component of BES4RAG is the automatic gener- The segmented passages are embedded using one or more
ation of MCQs from the input text. Using a LLM, the embedding models via the indexer module. This
modquestions_generator module selects random text ule computes and stores vector representations of the
segments from the normalized dataset and formulates passages.
          </p>
          <p>MCQs based on a customizable prompt template. The
standard prompt used for question generation is in Figure 3.5. Passages Retrieval
2. The questions are stored in JSON format.
3.3. Text Segmentation
Once the dataset is converted into text files, it is
segmented into passages suitable for indexing. The
passages_generator module performs this task by
applying a specified tokenizer to the input text. A key
consideration in this process is that the segmentation
into passages is determined by the embedding model
being used since the tokenizers have a maximum token
length. By default, the framework uses the maximum
token length supported. However, it is possible to specify
a smaller token length.</p>
          <p>Given a set of questions and indexed embeddings,
the passages_retriever module ranks the passages
based on similarity, typically using cosine similarity,
though other similarity metrics can be employed
depending on the embedding model. The retrieved passages are
then stored, organized by embedding model, allowing for
lfexible experimentation with diferent top-  retrieval
sizes.
3.6. Question Answering</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Using the retrieved passages and corresponding questions, the questions_answering module evaluates how well an LLM can answer each question in a RAG</title>
        <p>Create a multiple-choice question in the same language
as the text below, based solely on its content.
---------------------------&lt;&lt;&lt;text&gt;&gt;&gt;
---------------------------The question must be generic and must not contain
references to the article (e.g., "in the article..." or
"based on the text").</p>
        <p>If the text mentions a specific event, include full
details (e.g., name of war, date if available). Avoid
vague temporal references like "today."
Generate 4 answer options (1 correct, 3 plausible but
incorrect), each with an explanation of why it is
correct or not, based only on the text.</p>
        <p>Return your answer in this JSON format:
{
"question": "...",
"options": [</p>
        <p>{
}
setup. For each value of  (with default values of  =
0, 1, 2, 3, 4, 5, 10), the module combines the top-
retrieved passages with the question prompt and queries an
LLM to generate an answer. The prompt used for let the
LLM answer the questions is in Figure 3. The results are
stored in structured JSON files, organized by embedding
and LLM configuration.
The final module, q&amp;a_evaluator, assesses the
performance of the RAG system across diferent embeddings by
computing the answer accuracy over all questions. For
each embedding model and retrieval configuration (e.g.,
varying ), the module calculates accuracy and generates
a plot to visualize performance. This plot is crucial for
identifying the embedding model that leads to the best
overall performance in the specific domain or dataset
under analysis. Additionally, it helps determine the
optimal value of  for the considered task. This evaluation
also enables a comparison between free and open-source
embedding models and their proprietary counterparts,
providing insights into the trade-ofs between
computational cost and accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental Setup</title>
      <p>In this section, we describe the experimental setup used
to evaluate the performance of the proposed system. We
ifrst provide an overview of the datasets used, followed
by details about the embedding models and LLMs
employed in the pipeline. Finally, we explain the evaluation
metric adopted to measure the system’s performance in
answering questions.
4.1. Datasets
We evaluate our system on three distinct datasets, each
representing a diferent domain and content type. These
datasets were selected to test the system’s versatility and
ability to generalize across varying text types, from news
articles to transcripts of TV programs and movie scripts.
• RaiNews: This dataset consists of approximately
16,000 news articles, from the RaiNews portal,
covering a wide range of topics from current
events. The articles are typically short and serve
as concise textual documents, ideal for testing the
system’s ability to retrieve and generate answers
from concise content.
• Medicina33: This dataset includes roughly 159
full transcripts from the Medicina 33 TV program.
This Italian television program focuses on
medical topics, with discussions featuring experts in
the field of medicine. The transcripts are longer
with respect to the news, making them suitable
for testing the system’s handling of more
complex, specialized content.
• Movies: This dataset comprises approximately
2,000 movie scripts, metadata, and reviews. It
includes both short and long documents, providing
a diverse set of examples ranging from concise
summaries to lengthy dialogues. This dataset is
intended to evaluate the system’s performance
on text with a narrative structure and its ability to
handle various types of content, such as reviews
and scripts.
• dunzhanq/stella_en_1.5B_v56 [18]: A
largescale transformer model fine-tuned for English
sentence-level tasks, designed to provide
powerful embeddings for more complex textual data.</p>
      <sec id="sec-2-1">
        <title>Remark 1. We selected primarily multilingual embed</title>
        <p>The RaiNews and Medicina33 datasets are in Italian, ding models since our experiment involves two datasets
while the Movies dataset is in English. in Italian and one in English (see Section 4.1), to reduce
potential mismatches between dataset languages and
model training data. This choice ensures broader
lan4.2. Embedding Models guage coverage and more robust cross-lingual
represenIn our experiments, we distinguish between three main tations. However, BES4RAG does not aim to recommend
families of embedding models: ColBERT, OpenAI embed- a specific model a priori, but rather to evaluate a
userdings, and Sentence Transformers. defined set of models and identify the best-performing</p>
        <p>The ColBERT model, described in [15], is a state-of-the- one for the dataset considered.
art method for eficient and efective passage retrieval. To compare the embeddings produced by these models,
ColBERT uses a bi-level representation of text, allowing the most common similarity measure is cosine similarity,
for a more compact and computationally eficient rep- which computes the cosine of the angle between two
resentation of passages. The antoinelouis/colbert-xm2 vectors, capturing their relative orientation in the
emmodel, based on this framework, is a multilingual variant, bedding space. Cosine similarity is used for all models
providing advantages in multilingual tasks by capturing in our setup except for those in the ColBERT family. For
semantic meaning in multiple languages simultaneously. the latter, such as antoinelouis/colbert-xm, we instead</p>
        <p>Openai ofers a range of powerful models for generat- use the MaxSim function, a more specialized similarity
ing embeddings from text, including the text-embedding- measure designed for passage retrieval that works by
3-large3 model. The main disadvantage of these models ifrst computing the similarity between each individual
is that they are proprietary, and the vector representation query token and each document token using a similarity
is available only through a paid API. metric like cosine similarity; it then takes the maximum</p>
        <p>The Sentence Transformers family includes several mod- of these token-level similarities as the final relevance
els optimized for sentence-level embeddings. score between the query and the document.
• intfloat/multilingual-e5-large 4[16]: A multilin- Finally, for all datasets, the maximum token limits for
gual model capable of generating high-quality embeddings were applied to split the textual data into
embeddings for text in multiple languages. passages, except for the OpenAI model
text-embedding3-large (512 token limit), which is the same model as
• sentence-transformers/all-MiniLM-L6-v25 [17]: text-embedding-3-large but with maximum tokens length
A smaller, faster variant of the BERT model, pro- limited to 512. The decision of considering also this case
viding eficient sentence embeddings while main- was made based on the observation that increasing the
taining a high degree of accuracy for various NLP size of passages, although possible with this model, does
tasks. not necessarily improve the quality of the retrieved
information. This will become clear when observing the
results in Section 5.
2https://huggingface.co/antoinelouis/colbert-xm
3https://platform.openai.com/docs/models/
text-embedding-3-large
4https://huggingface.co/intfloat/multilingual-e5-large
5https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2</p>
      </sec>
      <sec id="sec-2-2">
        <title>6https://huggingface.co/dunzhanq/stella_en_1.5B_v5</title>
        <p>4.3. Large Language Models</p>
      </sec>
      <sec id="sec-2-3">
        <title>In our experimental setup, we employed two distinct</title>
        <p>families of LLMs for the generation of questions and
answering, respectively. For question generation, 5. Results and Discussion
GPT-4o7 model was adopted through the OpenAI API.</p>
        <p>For answering, we adopted two variants of the LLaMA RQ1: Optimal embedding choices vary
3.1 series developed by Meta: the 70-billion parameter across datasets
model meta-llama/Llama-3.1-70B-Instruct8
and the smaller 8-billion parameter version As observed in Figure 4, the accuracy of the Llama 3.1
meta-llama/Llama-3.1-8B-Instruct9. 70B model on automatically generated questions exhibits</p>
        <p>To ensure consistency and reduce stochastic variation variations not only with the number of retrieved
docuacross outputs, a temperature of 0 was used during infer- ments, but also with respect to the choice of embedding
ence for all models. Additionally, for answer generation model. The ranking of the embedding models varies
tasks, the maximum output length was restricted to a across datasets, as demonstrated by the diferent
persingle token, since the expected answer is always a dis- formance patterns observed in the first and subsequent
crete value in the set {0, 1, 2, 3}, in accordance with the positions. This variation highlights the dataset-specific
prompt specification described in Section 3.6. characteristics that influence the eficacy of embedding
models, further emphasizing the utility of the proposed
4.4. Evaluation Metric framework for selecting the optimal embeddings for
each dataset, rather than relying on a one-size-fits-all
approach.
often require detailed annotations that are not always
available.</p>
        <p>
          RQ2: Small LLMs can outperform bigger
LLMs with the right embedding
Unlike to what is done in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we do not aim to evaluate
the performance of our embedding models using a LLM
as an external judge. In other words, we do not rely on
the LLM to assess the quality of the retrieved passages
or to rate their relevance. Instead, we consider the end
goal of the pipeline: whether the final multiple-choice
answer produced by the RAG system is correct.
        </p>
        <p>To this end, we introduce a simple yet informative
metric that we refer to as Question Answering Accuracy
— or simply accuracy in the remainder of this paper. For
each question, the system selects an answer option based
on the response generated by the LLM, using the
passages retrieved by the embedding model. The accuracy
is computed as the proportion of questions for which the
selected answer matches the correct one, as defined in
the ground truth. This metric directly reflects the
efectiveness of the entire RAG pipeline in producing correct
answers, integrating both retrieval and generation
performance.</p>
      </sec>
      <sec id="sec-2-4">
        <title>In some cases, the choice of the embedding model may</title>
        <p>be even more critical than selecting the most powerful
LLM within a RAG system. This hypothesis is supported
by experimenting BES4RAG using two diferent LLMs
framework on the same dataset and with the same
embedding models. As shown in Figure 5, these
experiments demonstrate that using a more efective
embedding model with a smaller LLM can lead to better
performance than relying on a more powerful LLM
combined with weaker embedding models. In particular,
LLama 3.1 8B, when paired with antoinelouis/colbert-xm,
intfloat/multilingual-e5-large, or text-embedding-3-large,
outperforms the larger LLama 3.1 70B when the latter is
combined with sentence-transformers/all-MiniLM-L6-v2
Remark 2. Theoretically, the pipeline could be adapted to or dunzhanq/stella_en_1.5B_v5, at least for lower values
incorporate standard retrieval metrics such as those men- of . Indeed, for higher values of , the performance
tioned in Section 1, by changing the question generation of the smaller LLM deteriorates, likely due to the
inmodule so that questions are generated from individual creased prompt length exceeding its optimal processing
passages rather than from full documents. However, we capacity. These experiments highlight the importance of
adopt the Question Answering Accuracy metric for its di- carefully evaluating the choice of the embedding model,
rect alignment with the end goal of the RAG pipeline: especially when considering the use of smaller LLMs. In
selecting the embedding that enables correct answers. fact, selecting an efective embedding model can enable
While we acknowledge its binary nature and the lack the adoption of smaller language models, thus reducing
of granularity in capturing partial understanding or pas- computational requirements and leading to more
costsage quality, we consider this trade-of acceptable for an efective and resource-eficient solutions.
automated evaluation setup. More expressive metrics</p>
      </sec>
      <sec id="sec-2-5">
        <title>7https://openai.com/index/hello-gpt-4o/</title>
        <p>8https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
9https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
(a) RaiNews
(b) Medicina33
(c) Movies</p>
      </sec>
      <sec id="sec-2-6">
        <title>1,414 questions created by approximately eighty students</title>
        <p>enrolled in an undergraduate database course. These
students were instructed to formulate meaningful and
unambiguous multiple-choice questions based on the
movies scripts, plots and metadata.</p>
      </sec>
      <sec id="sec-2-7">
        <title>We then compared the accuracy scores obtained using Figure 5: Accuracy comparison between Llama 3.1 8B and these human-authored questions with the automatically</title>
        <p>Llama 3.1 70B on automatically generated questions from generated ones for the Movies dataset. Specifically, for
the Rainews dataset depending on the embedding models each embedding model and for each value of  in the
(the ones in Section 4.2, here with shortened names) and the top- retrieval, we computed the accuracy of the final
number of retrieved documents used to answer the questions answers returned by the RAG pipeline. This yielded two
(x-axis). matrices of scores: one for manual questions and one
for automatically generated questions, where rows
correspond to diferent embedding models and columns to
RQ3: Automatically generated and diferent  values.
user-generated questions We then calculated the Pearson correlation coeficient
between the corresponding entries of these two matrices
To assess whether evaluation using automatically gener- to quantify the alignment between the two evaluation
ated questions provides results consistent with human- modes. As shown in Table 2, the raw accuracy values
authored ones, we relied on a manually curated set of already exhibit a strong correlation ( = 0.78). When
applying min-max normalization per row (i.e., within each
embedding), the correlation improves slightly ( = 0.80),
indicating that the relative behavior of each model across
diferent  remains consistent. Finally, full matrix-wise
normalization further increases the correlation to  =
0.90, suggesting a strong structural similarity between
the two evaluation matrices. These findings support the
use of automatically generated questions as a viable proxy
for manual evaluation.</p>
        <p>Remark 3. In addition to the quantitative correlation
analysis, we manually inspected a random sample of both
human and automatically generated questions to assess
their coherence and correctness. The review confirmed
a high level of quality in both sets. The automatically
generated questions typically referred to more specific
and localized portions of the source text. Anyway, the
strong correlation observed between the two evaluation
modes further supports the use of automatically
generated questions as a reliable and eficient benchmark for
assessing embedding model performance.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion and Future Work</title>
      <sec id="sec-3-1">
        <title>In this work, we presented BES4RAG, a modular frame</title>
        <p>work for the evaluation of embedding models in
retrievalaugmented generation (RAG) pipelines. The framework
provides a comprehensive approach by focusing on
endto-end evaluation, incorporating automatic question
generation, passage segmentation, and answer evaluation.</p>
        <p>Unlike traditional methods, which rely on pre-retrieval
metrics, BES4RAG integrates task-specific performance
assessments, allowing for a more accurate comparison
of embedding models based on their impact on the final
output.</p>
        <p>BES4RAG is also versatile, making it suitable for a
variety of use cases, including datasets that represent
subsets of larger corpora. A prime example would be
transcribed multimedia archives, where smaller portions
of the dataset can be used to efectively represent the
entire collection.</p>
        <p>Although BES4RAG demonstrates strong performance
and general applicability across diverse datasets, it is
not without limitations. One notable limit lies in its
reliance on automatically generated MCQs, which,
although eficient and scalable, may not always be adequate
in highly domain-specific contexts, i.e. in technical or
expert-driven fields where factual precision or nuanced
phrasing is critical. Furthermore, the binary nature of the
evaluation metric is easily interpretable, but it can fail to
capture partial understanding, near-miss responses, or
the contextual relevance of the retrieved passages. This
trade-of between simplicity and expressiveness, while
intentional for automation and reproducibility, highlights
the need for complementary metrics or qualitative
assessments in more complex scenarios.</p>
        <p>Looking ahead, avenues for future work include the
following:
• Investigating whether using two diferent LLMs
for question generation and retrieval provides
better performance or if using the same LLM for
both tasks yields comparable results.
• Exploring alternative methods for question
gener</p>
        <p>ation that consider larger portions of documents.
• Introducing new metrics to assess questions
without options, potentially linking detailed answers
back to one of the predefined options, ofering
more flexibility in evaluating the question-answer
generation process.
• Integrate within the pipeline some element that
returns statistical significance measures of the
results obtained, such as paired tests to assess
whether diferences between embedding models
are statistically significant. Moreover, regarding
the evaluation of LLM’s answers it could be
interesting to analyze the token-level probability
distribution to assess how embeddings afect the
confidence of LLM predictions.
• Study the scalability of the proposed approach
on significantly larger datasets, evaluating both
its performance and reliability under increased
data volume, as well as the computational time
and resource requirements of the entire pipeline.
Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrievalaugmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Egger</surname>
          </string-name>
          ,
          <source>Text Representations and Word Embeddings</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>361</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -88389-8_
          <fpage>16</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -88389-8_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kim</surname>
          </string-name>
          , J. Springer, A. Raghunathan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <article-title>Mitigating bias in rag: Controlling the embedder</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.17390. arXiv:
          <volume>2502</volume>
          .
          <fpage>17390</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence eration learning system for improve users' literembeddings using siamese bert-networks</article-title>
          , in: Pro- acy skills,
          <source>The Journal of the Korea institute of ceedings of the 2019 Conference on Empirical Meth- electronic communication sciences 19</source>
          (
          <year>2024</year>
          )
          <fpage>1243</fpage>
          - ods in
          <source>Natural Language Processing, Association 1248. for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https: [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Planning first, question sec//arxiv.org/abs/
          <year>1908</year>
          .10084.
          <article-title>ond: An LLM-guided method for controllable</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Koopman</surname>
          </string-name>
          ,
          <article-title>Semantic embedding question generation</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Marfor information retrieval</article-title>
          , in: 5th Workshop on tins, V. Srikumar (Eds.),
          <article-title>Findings of the AsBibliometric-Enhanced Information Retrieval, BIR sociation for Computational Linguistics: ACL 2017</article-title>
          , CEUR,
          <year>2017</year>
          , pp.
          <fpage>122</fpage>
          -
          <lpage>132</lpage>
          .
          <year>2024</year>
          , Association for Computational Linguistics,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          , Comparing the sensi- Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>4715</fpage>
          -
          <lpage>4729</lpage>
          . URL:
          <article-title>https: tivity of information retrieval metrics</article-title>
          , in: Pro- //aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>280</volume>
          /. doi: 10. ceedings of the 33rd
          <source>International ACM SIGIR 18653/v1/2024.findings-acl.280</source>
          . Conference on Research and Development in In- [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Colbert: Eficient and efformation Retrieval, SIGIR '10, Association for fective passage search via contextualized late interComputing Machinery</article-title>
          , New York, NY, USA,
          <year>2010</year>
          , action over bert,
          <year>2020</year>
          . URL: https://arxiv.org/abs/ p.
          <fpage>667</fpage>
          -
          <lpage>674</lpage>
          . URL: https://doi.org/10.1145/1835449.
          <year>2004</year>
          .
          <volume>12832</volume>
          . arXiv:
          <year>2004</year>
          .
          <volume>12832</volume>
          . 1835560. doi:
          <volume>10</volume>
          .1145/1835449.1835560. [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Magne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Multilingual e5 text embeddings: A technical MTEB: Massive text embedding benchmark</article-title>
          ,
          <source>in: report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2402.05672. A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>Proceedings arXiv:2402.05672. of the 17th Conference of the European Chap</source>
          <volume>-</volume>
          [17]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zhou, ter of the Association for Computational Lin- Minilm: Deep self-attention distillation for taskguistics, Association for Computational Linguis- agnostic compression of pre-trained transformtics</article-title>
          , Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>2014</fpage>
          -
          <lpage>2037</lpage>
          . ers,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2002</year>
          .10957. URL: https://aclanthology.org/
          <year>2023</year>
          .eacl-main.
          <volume>148</volume>
          /. arXiv:
          <year>2002</year>
          .10957. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .eacl-main.
          <volume>148</volume>
          . [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Jasper
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Isbarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huseynova</surname>
          </string-name>
          ,
          <article-title>Enhanced document re- and stella: distillation of sota embedding modtrieval with topic embeddings</article-title>
          ,
          <source>in: 2024 IEEE 18th els</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2412.19048.
          <source>International Conference on Application of Infor- arXiv:2412.19048. mation and Communication Technologies (AICT)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/AICT61888.
          <year>2024</year>
          .
          <volume>10740455</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kukreja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bharate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purohit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <article-title>Performance evaluation of vector embeddings with retrieval-augmented generation</article-title>
          ,
          <source>in: 2024 9th International Conference on Computer and Communication Systems (ICCCS)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>333</fpage>
          -
          <lpage>340</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICCCS61882.
          <year>2024</year>
          .
          <volume>10603291</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caspari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. G.</given-names>
            <surname>Dastidar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zerhoudi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mitrovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Granitzer</surname>
          </string-name>
          ,
          <article-title>Beyond benchmarks: Evaluating embedding model similarity for retrieval augmented generation systems</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/ abs/2407.08275. arXiv:
          <volume>2407</volume>
          .
          <fpage>08275</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Şakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Emekci</surname>
          </string-name>
          ,
          <article-title>Maximizing rag eficiency: A comparative analysis of rag methods</article-title>
          ,
          <source>Natural Language Processing</source>
          <volume>31</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1017/ nlp.
          <year>2024</year>
          .
          <volume>53</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Déjean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clinchant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>Context embeddings for eficient answer generation in retrieval-augmented generation</article-title>
          ,
          <source>in: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>493</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Park</surname>
          </string-name>
          , S.-M. Park,
          <article-title>Llm-based question gen-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>