<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Difusion-Aided RAG: Elevating Dense-Retrieval Chatbots via Graph-Based Difusion Reranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sai Teja Dampanaboina</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sai Nishchal Gamini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karishma Kunwar</string-name>
          <email>R@10</email>
          <email>R@20</email>
          <email>R@5</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Polignano</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Levantesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ernesto William De Luca</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>George Eckert Institute</institution>
          ,
          <addr-line>Brunswick</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leibniz Institute for Educational Media</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Otto-von-Guericke University</institution>
          ,
          <addr-line>Universitätspl. 2, 39106 Magdeburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University Of Bari Aldo Moro</institution>
          ,
          <addr-line>via E. Orabona 4, 70125, Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a comprehensive framework for enhancing dense-retrieval-based chatbots through the integration of graph-based difusion reranking. Addressing challenges in traditional retrieval-augmented generation (RAG) systems, the proposed methodology incorporates a multi-step pipeline that advances document retrieval and relevance ranking. Initially, candidate passages are retrieved via dense embeddings, followed by the construction of a graph representation that captures inter-passage semantic relationships. Through a graph-based difusion process, the reranking mechanism refines the selection, amplifying clusters of contextually relevant documents while mitigating noise efects from irrelevant data points. Experimental results demonstrate significant gains in retrieval quality and question-answering accuracy, underscoring the framework's potential for knowledge-intensive real-time applications such as conversational AI. This work reflects a pivotal step towards developing highly accurate, dynamic, and scalable multimodal conversational systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval-Augmented Generation</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Chatbots</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>PageRank</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>the meaning of queries and documents. This method,
called dense passage retrieval [4], makes it possible
Advanced chatbots and other modern NLP tools need to find passages that are related in meaning even if they
fast access to up-to-date, specific information. Although don’t share the same exact words. Still, pulling back the
Large Language Models (LLMs) can generate fluent re- single best set of passages from an enormous database
sponses and handle a wide range of topics, they’re stuck is tough, and the first batch of results often require
adwith whatever they learned during training, and their ditional refinement to make sure they’re really on point.
knowledge can become outdated or be too general [1]. That’s why it’s common to run additional steps like
reRAG [2] solves this by enabling the LLM to retrieve infor- ranking to fine-tune and improve the final selection.
mation from an external database that can be updated in We aim to make dense passage retrieval work even
real time. This strategic decoupling of the LLM’s genera- better by using a multi-step pipeline. First, we pull an
tive function from data management, including storage, initial batch of candidate passages with a dense retriever.
indexing, and crucially, retrieval, allows for continuous Then we turn those top documents into a graph and run
knowledge updates, thereby enhancing the responsive- a difusion process over it. This lets us capture how the
ness, reliability, and domain fidelity of such systems. Be- passages relate to each other. By using this graph-based
ing able to quickly and accurately find the right informa- difusion as a re-ranker, we can tweak the initial scores
tion from a variety of sources is essential for powering so that the most truly relevant passages end up at the
these next-generation NLP systems. top. The objective is to demonstrate how combining</p>
      <p>Information Retrieval (IR) has evolved a lot to help us dense retrieval with graph-based difusion re-ranking
ifnd relevant information more quickly and accurately can yield superior retrieval performance, providing a
in huge collections of text [3]. Instead of just matching more accurate and contextually relevant set of documents
words on the page, many modern systems use dense rep- essential for applications requiring dynamic knowledge
resentations; basically, numeric embeddings that capture access.</p>
      <p>In recent years, advances in Large Language Models
(LLMs) and AI-driven dialog systems have enabled more
dynamic, retrieval-augmented conversational platforms.</p>
      <p>et al.
in the REALM framework [5], which demonstrated the trieval accuracy. We have designed a web application and,
efectiveness of retrieval-augmented language model pre- the processing pipeline consists of six sequential stages:
training by fine-tuning on open-domain question answer- input acquisition, intent classification, intent-based
routing (Q&amp;A). At inference time, REALM fetches documents ing, dense retrieval, graph-based re-ranking, and large
using dense embeddings and conditions the generator language model (LLM) response generation. All inference
on retrieved passages. Building on this idea, Lewis et components are deployed on a GPU when available, with
al. formalized the Retrieval-Augmented Generation a fallback to CPU. A local Milvus-Lite instance serves
(RAG) architecture [2] , showing that coupling dense as the vector store [11], and Google’s Gemini Pro model
retrieval with a pretrained sequence-to-sequence model [12] functions as the core LLM.
improves factual grounding and generalization in Q&amp;A.</p>
      <p>Unlike traditional LLMs that rely solely on parametric 3.1. System Architecture
memory, RAG leverages a non-parametric index to fetch
up-to-date, domain-specific information during genera- The chatbot is implemented as a modular Flask server
tion. [13] that listens for cross-origin requests. Upon
ini</p>
      <p>Prior to dense retrieval, sparse vector-space meth- tialization, the server launches a Milvus-Lite instance,
ods—such as TF-IDF or BM25—were the de facto stan- creating or loading a collection named rag_collection
dards for fetching relevant documents [6]. Although into memory from a persistent storage directory
(./milBM25 performs well on short, keyword-based queries, vus_data). Simultaneously, several models are pre-loaded
it struggles with semantic matching in open-domain to minimize inference latency: a. Speech-to-Text: The
contexts[7]. Karpukhin et al. [4] showed that a dual- OpenAI Whisper medium model (769M parameters)[14],
encoder dense retrieval model, trained on relatively few b. Intent Classification: A LoRA-fine-tuned
RoBERTaquestion–passage pairs, could outperform a strong BM25 base model [15], c. Language Generation: The Google
baseline. Subsequent work by Xiong et al. [8] and Qu Gemini client, configured via an API key [ 16]. The
emet al. [9] confirmed that dense retrievers better handle bedding model, openai/clip-vit-base-patch32, is
paraphrased, abstract, and long-tail queries. These stud- loaded. The system’s behavior can be dynamically
ies also highlighted challenges in dense retrieval—such altered via a dedicated API endpoint that toggles a
as selecting hard negatives and mitigating false nega- "GLOBAL_SEARCH_MODE" flag, forcing all queries to
tives—and proposed improvements in training objectives be routed to the web search module, thereby bypassing
and negative sampling strategies. intent classification. API keys for Gemini and SerpAPI</p>
      <p>Despite these advancements, the top-k passages re- are managed as environment variables.
turned by a dense retriever may include semantically
similar but contextually irrelevant documents. To
address this, our work introduces a graph-based
difusion re-ranking step over the initial dense retrieval
results. This idea is inspired by Donoser and Bischof’s
difusion process for visual retrieval [ 10], where each
document is treated as a node in a similarity graph and
scores propagate through edges to refine ranking. We
adapt this difusion-based re-ranking to text-based
retrieval by constructing a graph over the top retrieved
chunks and iteratively propagating similarity scores to
emphasize manifold structure rather than relying solely
on pairwise dot products.</p>
      <p>However, to our knowledge, prior RAG-style systems
have not integrated graph-based difusion re-ranking to
refine their dense retrieval outputs. In this paper, we
propose such an integration and demonstrate its
efectiveness on benchmark Q&amp;A datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology</title>
      <sec id="sec-2-1">
        <title>This section details the design and implementation of our</title>
        <p>dense-retrieval chatbot. The system employs a
graphbased difusion re-ranking mechanism to enhance
re3.2. Corpus Construction and Indexing to web search, it fetches the top 20 results, forwards to
the LLM along with the conversation history and the
The knowledge base for Retrieval-Augmented Gener- LLM generates the response. If the query is directed to
ation (RAG) is derived from a collection of PDF and the RAG retriever, the dense retriever and page re-ranker
plain-text documents stored in a designated directory. comes into play which retrieves the relevant document
An ofline ingestion script (ingest_embeddings.py) pro- chunks from the vector database and forwards them to
cesses these sources into a searchable vector index. LLM for it to generate a response. A global flag can
overFirstly, PDF documents are converted to Markdown us- ride this logic and force any query to use the Web Search
ing the Docling library[17], with OCR enabled to extract path. When enabled, even queries that would normally
text from scanned pages. Plain-text files are read di- directed to the RAG Retriever or go straight to the LLM
rectly. The Markdown content is first segmented into are redirected to fetch live results via the SERP API. This
logical blocks (e.g., headings, paragraphs, table rows). ensures that all responses are grounded in the most
up-toThese blocks are then aggregated into chunks of up date information available. This is ideal for time-sensitive
to 500 words with a 50-word overlap between consec- domains like news, finance, or rapidly evolving technical
utive chunks. This overlap strategy ensures contex- ifelds.
tual continuity across chunk boundaries. Each text
chunk is embedded using the Hugging Face
implementation of openai/clip-vit-base-patch32 [18]. The 3.4. RAG Search: Dense Retrieval and
get_text_features() method produces a 512-dimensional Difusion Reranking
vector, which is then normalized to unit ℓ2 norm. The
resulting embedding vectors are indexed in the Milvus-Lite
rag_collection. Each entry includes the vector (emb) and
associated metadata: source_path, a unique chunk_id,
the full chunk_text, and a 200-character chunk_preview.</p>
        <p>An IVF_FLAT index is built on the embedding field with
nlist = 128, partitioning the vector space to accelerate
searches. The entire index is loaded into memory for
high-speed nearest-neighbor lookups.</p>
        <sec id="sec-2-1-1">
          <title>3.3. Core Processing Pipeline</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Incoming user requests, whether text or speech, trigger</title>
        <p>a multi-stage process to generate a contextually relevant
response. The system acquires user input through two
primary endpoints: a speech input API that transcribes
audio files using a Whisper model [ 14], and a text input
API that accepts JSON payloads with the conversation
history [1]. Once the user’s query is obtained, it
undergoes intent classification by a fine-tuned RoBERTa-base
model (which has been fine-tuned by us using Low- Rank
Adaptation (LoRA) technique on a synthetic dataset
curated by us which is used for training, categorizes the
text as a"RAG Search", "Web Search", "Greeting" or
"Conversation Meta".</p>
        <p>This classification model was optimized using Low
Rank Adaptors with a rank of r=8 and a scaling factor
of  =16, applied to the query and key projection
matrices. Based on the resulting intent, the query is routed
down one of three paths: "Greeting and Conversation
Meta" intents bypass retrieval and generate a direct
response from the role of the LLM and conversation history
respectively; a "Web Search" classification triggers a
webaugmented generation path; lastly, the "RAG Search"
intent activates a dense retrieval and re-ranking pipeline for
a RAG-augmented response. When the query is directed</p>
      </sec>
      <sec id="sec-2-3">
        <title>For queries that are classified as "RAG Search" requir</title>
        <p>ing information from the internal knowledge base, the
system executes a sophisticated retrieval and reranking
process. During initial retrieval, the raw query is
embedded using the openai/clip-vit-base-patch32[18]
model to produce a 512-dimensional query vector, vec.
This vector is used to search the Milvus collection for the
top 50 most similar chunks based on inner-product
similarity, with search parameter nprobe=10. The top  (up
to 50) candidate chunks are used to construct a weighted,
undirected graph G = (V, E), where each node  ∈ 
represents a candidate chunk. An edge (,  ) ∈  is
created for every pair of nodes, with its weight set to
the cosine similarity between their respective embedding
vectors. This results in a complete graph that captures
the semantic manifold of the candidate set.</p>
        <p>To refine the initial ranking, we employ personalized
PageRank (Difusion) . A personalization vector p is
constructed directly from the raw dense retrieval scores 
of the  candidate chunks, where each component is
proportional to the initial dense retrieval score of candidate
. Thus, p is neither empty nor randomly initialized—it
is deterministically defined by normalizing the retrieval
scores, ensuring higher-scored chunks receive greater
weight:

p ̸= 0, ∑︁  = 1.</p>
        <p>=1

 = ∑︀
=1 
,  = 1, . . . , ,
(1)</p>
        <p>This vector biases the random walk towards candidates
that were originally most relevant chunks standing before
we apply the graph difusion step to the query. The final
PageRank scores,  ∈ IRn, are computed iteratively via
the NetworkX library [19], solving the equation:
"text": Role + conversation history,
"rawQuery": User Query,
"skipApiKeyValidation": false</p>
      </sec>
      <sec id="sec-2-4">
        <title>For RAG Search, the formatted top-20 re-ranked chunks are appended under a Retrieved Context: heading</title>
        <p>"text": Role + conversation history,
"rawQuery": User Query,
"skipApiKeyValidation": false
"Retrieved Context": Top 20 Chunks</p>
      </sec>
      <sec id="sec-2-5">
        <title>For Web Search, the serialized JSON from SerpAPI is appended under a Web Search Results: heading in the JSON file.</title>
        <p>avoid getting stuck in tight clusters. Values much higher in the JSON file.
[9], striking a balance between “walking” the similarity
graph (propagating scores along edges) and
“teleporting” back to the seed nodes (initial retrieval scores) to
(&gt;0.9) can slow convergence and over-emphasize dense
subgraphs; values much lower (&lt;0.7) behave more like1 {
pure retrieval without graph smoothing. This difusion 2
process up-ranks candidates that belong to dense, seman-3
tically coherent clusters within the graph, mitigating the4
risk of relying on isolated high-similarity outliers. For5
context formualtion between the selected candidates, The6 }
candidates are sorted by their final PageRank scores in
descending order, and the top K = 20 chunks are selected
to form the Retrieved Context. If the re-ranking step
is disabled or fails, the system falls back to the top 20
candidates from the initial dense retrieval. The top 201 {
candidates are selected because with trial and error we2
have decided that selecting 20 number of candidates to3
pass through the LLM is suficient to cover the enough 4
potential context so that the relevant bits are not lost,5
but to avoid dragging too many of topic chunks that
dilute the difusion signal. Also, a 20-node graph is small 6 }
enough for sub-100 ms difusion passes, keeping
end-toend latency low. If the collection is huge, increasing the</p>
        <sec id="sec-2-5-1">
          <title>3.5. Web Search Augmentation</title>
          <p>For queries with the Web Search intent, the system
queries the Google Search engine via the SerpAPI. The
query retrieves the "answer box" and up to 20 top organic
results. The structured JSON response from the API is
serialized into a string. If the API call fails, the process
continues without web context.</p>
        </sec>
        <sec id="sec-2-5-2">
          <title>3.6. LLM Prompting and Response</title>
        </sec>
        <sec id="sec-2-5-3">
          <title>Generation</title>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>All prompts are submitted to the gemini-2.5-pro</title>
        <p>model [12]. The final prompt is dynamically assembled
based on the routing path: Every prompt begins with a
ifxed role definition and the current conversation history.</p>
      </sec>
      <sec id="sec-2-7">
        <title>For example the payload JSON file would look like as</title>
        <p>follows. In place of Role we would define the role of the</p>
      </sec>
      <sec id="sec-2-8">
        <title>LLM to give it a persona and in place of conversation history we would have the conversations between the User and the LLM.</title>
        <p>"text": Role + conversation history,
"rawQuery": User Query,
"skipApiKeyValidation": false
"Web Search Results": Top 20 search
results
number of top candidates to pass to the LLM would be tional context is added. The final composite prompt is
recommended.</p>
      </sec>
      <sec id="sec-2-9">
        <title>For Greeting and Conversation Meta intents, no addisent to the Gemini API. The extracted text from the response is returned to the client in a JSON object containing the reply and the original intent.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiment</title>
      <sec id="sec-3-1">
        <title>To evaluate the eficacy of our proposed chatbot,</title>
        <p>particularly the contribution of graph-based difusion
reranking, we designed a series of experiments. Our
evaluation aims to answer three primary research
questions:
RQ1: Component Eficacy:</p>
      </sec>
      <sec id="sec-3-2">
        <title>How accurately does the intent classification module route user queries to the appropriate processing pipeline?</title>
        <p>RQ2: Retrieval Efectiveness:
graph-based difusion reranking significantly improve
the quality of retrieved documents compared to standard
dense retrieval baselines?
RQ3: End-to-End Performance: Does the enhanced
retrieval quality from our reranking module translate
into more accurate, faithful, and helpful final responses
generated by the LLM?</p>
      </sec>
      <sec id="sec-3-3">
        <title>Does the proposed</title>
      </sec>
      <sec id="sec-3-4">
        <title>This section details the experimental setup, the datasets used, the baselines for comparison, the evaluation metrics, and a thorough analysis of the results.</title>
        <sec id="sec-3-4-1">
          <title>4.1. Experimental Setup</title>
          <p>4.1.1. Dataset Construction
github repository.</p>
          <p>We followed the similar procedure to generate the
dataset for training (a hybrid dataset, some elements of
the dataset are also taken from the publicly available
dataset [26]) and evaluation dataset for the intent
classifier. The scripts for the evaluation data and training
dataset can be found in the publicly available script in
the github repository.</p>
          <p>To perform a realistic evaluation, we constructed a
domain-specific question-answering dataset tailored to
the Otto von Guericke University (OVGU) context in En- 4.1.2. Implementation Details
glish language, but same process can be followed for any All experiments were conducted on a single machine
other application domain. This reflects a practical ap- equipped with a Ryzen 7 7800H, NVIDIA RTX 4060
plication scenario where students frequently seek quick, GPU with 8GB VRAM and 16 GB of RAM. The system
reliable answers to academic queries,such as course de- implementation uses the library versions specified
retails, procedures or administrative processes which are quriements.txt file. The key hyperparameters for the
typically spread across the university website and oficial RAG pipeline, including  =0.85 for PageRank and K=20
documents. The dataset was created as follows. for the number of retrieved chunks, were kept constant</p>
          <p>We generated a retrieval-augmented question–answer across all experiments.
(QA) dataset directly from our institutional PDF
regulations and module handbooks using an end-to-end
open-source pipeline. First, all PDF files were loaded via 4.2. Intent Classification Fine-Tuning
LangChain’s PyPDFLoader [20] and split into overlap- We fine-tuned RoBERTa-base for intent classification
ping text chunks (1000 characters, 200 characters over- using a parameter-eficient LoRA setup. Our pipeline
lap) with CharacterTextSplitter [21]. Each chunk was comprises dataset preparation, LoRA integration,
trainencoded into a FAISS vector store [22] using sentence- ing, and evaluation. A CSV dataset of user queries
transformer embeddings (all-MiniLM-L6-v2) [23]. (Question) and intent labels (Label) was loaded,
labelTo produce questions, we initialized a local causal encoded, and split 80/20 (seed 42) in a stratified
fashLLM (Llama-2-7B via Hugging Face’s text-generation ion. Queries were tokenized with RoBERTa’s
tokpipeline) wrapped by LangChain’s HuggingFacePipeline enizer (max length 64), producing input_ids and
[24]. However, any other embedding strategy or LLM attention_mask fields wrapped in Hugging Face
could be used [25]. A few-shot prompt — “Given the Dataset objects. We loaded roberta-base configured
following excerpt, generate n unique, questions answer- for  intents and applied LoRA adapters (PEFT) to the
able from this content” — was applied to each chunk (n query and key projections with rank  = 8,  = 16,
= 2). Generated questions were de-duplicated in a case- and dropout  = 0.05, freezing all other model weights.
insensitive manner, yielding a pool of 80 unique ques- Fine-tuning ran on GPU (or CPU) with seed 42. Key
hytions.For each question, we ran a retrieval-augmented perparameters: We used Hugging Face’s Trainer with
QA chain: the FAISS retriever returned the top k = 4 most
relevant chunks, and the LLM instantiated a “stuf”-type Hyperparameter Value
chain to produce concise answers, each appended with
inline citations pointing to the source document chunk. Learning rate 2 × 10− 5
All Q&amp;A pairs were compiled into a final CSV (ques- BWaeticghhtsidzeecay 80.01
tion,answer) named RAG_evaluation_Dataset.csv, result- Epochs 3
ing in 80 high-quality, syllabus-grounded items. Our Evaluation strategy End of each epoch
fully local workflow relies exclusively on open-source Checkpoint retention Last two checkpoints
models (sentence-transformers for embeddings; Hug- Selection criterion Best validation accuracy
ging Face model for LLM) and FAISS for vector retrieval, Logging frequency Every 50 steps
ensuring reproducibility and data privacy. All
hyperparameters (chunk size, overlap, k, temperature = 0.3, Table 1
max_new_tokens = 512) are documented in our publicly Fine-tuning hyperparameters for intent classification.
available script. The resulted csv file is manually verified
and introduced with typos into question to add noise to an accuracy metric (scikit-learn). The best checkpoint (by
the query to simulate the real world queries. Also we validation accuracy) was evaluated on the held-out split.
have manually rechecked the answer by going through For inference, inputs are tokenized to length 64, passed
the utilized documents.The dataset can be found in our through the model, and predicted indices are mapped
back to label strings. This LoRA-based approach updates
a small fraction of parameters, yielding fast convergence
and lightweight deployment.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>4.3. Baselines and System Variants</title>
          <p>4.4.3. End-to-End Response Quality</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>To answer RQ2, we evaluated the final generated re</title>
        <p>sponses. Automatic Metrics like ROUGE-L (Measures
n-gram overlap with the reference answer, focusing on
recall.), BERTScore (Computes the semantic similarity
between the generated response and the reference answer
using contextual embeddings.) have been considered.</p>
      </sec>
      <sec id="sec-3-6">
        <title>We compared our full system against several baselines</title>
        <p>and ablations to isolate the impact of our contributions.</p>
        <p>Lexical Baseline (BM25): A classic sparse retrieval
system using TF-IDF with the Okapi BM25 algorithm. This
represents a traditional, non-neural IR baseline. Dense
Retrieval Baseline (Dense-NoRerank): This system
uses the same CLIP-based query embedding and Mil- 4.5. Results and Analysis
vus index as our proposed method but omits the
graphreranking step. It simply takes the top-K results based
on raw inner product similarity. This serves as our
primary ablation to directly measure the impact of difusion
reranking. Proposed System (Dense-Rerank): Our full
RAG pipeline as described in Section II, which includes
initial dense retrieval followed by graph-based difusion
reranking. For end-to-end evaluation, the retrieved
context from each of these three systems is fed into the same
Gemini-2.5-pro model [12] using an identical prompt 1.4
structure. 1.2
4.5.1. Intent Classification Performance (RQ1)</p>
      </sec>
      <sec id="sec-3-7">
        <title>We finetuned the model for three full epochs using a</title>
        <p>linear learning-rate schedule from 2 × 10− 5 down to
0. Figures 2–5 summarize key training diagnostics. The
details on finetuning of the model can be seen in the
section 4.2</p>
        <p>Training Loss vs. Epoch</p>
        <p>Evaluation Accuracy vs. Epoch
1.00
1.25
1.50
1.75
2.00
Epoch
2.25
2.50
2.75
3.00
4.4.2. Retrieval Performance</p>
      </sec>
      <sec id="sec-3-8">
        <title>To answer RQ1, we have evaluated the quality of the ranked list of documents returned by each retrieval system against the annotated ground-truth chunks.</title>
        <p>Normalized Discounted Cumulative Gain
(nDCG@K): Measures the quality of the ranking,
rewarding systems that place highly relevant documents
at the top of the list. We reported nDCG@5, nDCG@10,
and nDCG@20.</p>
        <p>Mean Reciprocal Rank (MRR): Measures the average
reciprocal rank of the first relevant document. It is
particularly sensitive to how high the very first correct
answer is ranked.</p>
        <p>Recall@K: The proportion of relevant documents found
1.0
s
so0.8
L
g
n
i
ian0.6
r
T
0.4
0.2
0.9996
y0.9994
rcccauA
iton0.9992
a
u
l
a
v
E0.9990
0.9988</p>
        <sec id="sec-3-8-1">
          <title>4.4. Evaluation Metrics</title>
        </sec>
      </sec>
      <sec id="sec-3-9">
        <title>We employed an automatic evaluation metric to assess performance at diferent stages of the pipeline.</title>
        <p>4.4.1. Intent Classification
Wstaenedvaarlduamteedtrtihces oLnoRaAh-teuldn-eoduRtoteBsEt RsTetafcrloamssiofieurrusainnngo- 0.0 0.5 1.0 E1p.o5ch 2.0 2.5 3.0
tated dataset. Accuracy: Overall percentage of correctly Figure 2: Training loss as a function of epoch. Loss fell
precipclassified intents. Macro-F1 Score: The unweighted itously from 1.38 to 0.05 within the first half-epoch and then
mean of the F1-scores for each of the four intent classes, decayed asymptotically toward zero by epoch 3.
providing a balanced measure of performance.</p>
        <p>1e 5</p>
        <p>Learning Rate vs. Epoch
such as RAG Search, Web Search, Greeting,
Conversation_Meta has 4K rows, to evaluate the model right after
ifnetuning and a custom made external dataset with the
real world queries which is constructed with the same
procedure mentioned in 4.1.1 to check the confusion
matrix apart from the confusion matrix generated from
the synthetic dataset. The discussed external dataset has
typos which generally seen in the real world usage.</p>
        <p>Table 2 reports Accuracy and Macro-F1 on the
heldout portion of our annotated dataset. Table 3 and Table 4
shows the corresponding 4×4 confusion matrix (true ∖
predicted).</p>
      </sec>
      <sec id="sec-3-10">
        <title>After finetuning the model, we have tested it in two ways, using an previously discussed synthetic dataset, which has 16K rows, where each classification</title>
      </sec>
      <sec id="sec-3-11">
        <title>On the annotated dataset, we achieve 99.88% Accuracy</title>
        <p>The rapid decline in training loss (Fig. 2) demonstrates and Macro-F1, with only four misclassifications (all in
that the model quickly learns low-level patterns. the “greeting” intent). On the external test dataset , we
Evaluation accuracy (Fig. 3) increases monotonically, observe perfect scores with no of-diagonal errors. These
from 99.875% to 99.96875%, while evaluation loss falls results indicate that our LoRA-tuned RoBERTa model is
from 0.00416 to 0.00123, indicating continued but highly reliable for routing user utterances to their correct
diminishing generalization improvements across epochs. intents under both in-domain and held-out conditions.
The learning-rate schedule (Fig. 4) balances coarse early
updates and fine-tuning in later epochs, and the gradient 4.5.2. Retrieval Efectiveness (RQ2)
norms (Fig. 5) confirm that the optimizer transitions
smoothly from high-magnitude updates to stable, small
magnitudes without oscillation or divergence. Overall,
these results validate our choice of schedule and training
regime, showing strong convergence with minimal
overfitting.</p>
      </sec>
      <sec id="sec-3-12">
        <title>We tested the retriever with the custom dataset that we</title>
        <p>have discussed earlier in 4.1.1.</p>
        <p>The difusion-based reranking step yields a
substantial lift over plain dense retrieval: Early-rank gains:
nDCG@5 increases from 0.82 to 0.90 (+9.8%), and MRR
from 0.88 to 0.95 (+8.0%), showing that the first
relevant chunk is more consistently ranked at the very top.</p>
        <p>Broader coverage: Recall@5 improves from 0.90 to 0.94,
indicating almost majority of the relevant passages are
captured within the top 5 results.</p>
        <p>Method
BM25
Dense Retrieval (no rerank)
+ Difusion Re-Ranking
MRR</p>
        <p>This improvement stems from the Personalized PageR- constructing a semantic similarity graph over the top-
ank difusion over the dense-embedding graph: Clus- candidate chunks and applying a personalized PageRank
ter promotion: Semantically coherent clusters of chunks difusion, our method consistently boosts early-rank
remutually reinforce each other, raising their rank. Noise trieval metrics (nDCG@5, MRR) and broad recall (R@20),
suppression: Isolated or tangential hits receive little difu- translating directly into higher ROUGE-L and BERTScore
sion signal and therefore drop down the list. As a result, on end-to-end QA generation. The framework is eficient
Dense-Rerank not only boosts the presence of highly rel- enough for real-time applications, relies on open-source
evant documents at top positions (driving up nDCG and components (Milvus, CLIP, Gemini), and demonstrates
MRR) but also enhances overall recall within the critical robustness across both synthetic and external query sets.
early ranks.</p>
        <sec id="sec-3-12-1">
          <title>5.1. Limitations</title>
          <p>4.5.3. End-to-End Generation Quality (RQ3)</p>
        </sec>
      </sec>
      <sec id="sec-3-13">
        <title>The current Difusion-Aided RAG framework, while</title>
        <p>The improvements in automatic metrics mirror our re- demonstrating significant improvements in retrieval
eftrieval findings (RQ2): Faithfulness and Helpfulness: fectiveness and generation quality, exhibits several
critHigher ROUGE-L and BERTScore for Dense-Rerank in- ical limitations that warrant careful consideration for
dicate more accurate and relevant content generation, broader deployment and cross-linguistic applications.
thanks to the superior top-K retrieval. Retrieval → The most pronounced limitation concerns
hyperparameGeneration Link: RQ2 showed that difusion reranking ter sensitivity, particularly regarding the damping factor
promotes centrally relevant chunks; RQ3 demonstrates  = 0.85 employed in the personalized PageRank
difuthat feeding those higher-quality chunks into the LLM sion process. This parameter, borrowed from the
canonyields outputs that better match reference texts (ROUGE- ical PageRank algorithm, was empirically validated on
L) and higher semantic overlap (BERTScore). Fluency: the OVGU academic dataset but may exhibit suboptimal
We observed similar fluency across all three systems (not performance across diferent domains or linguistic
conshown), as fluency is primarily governed by the pre- texts. The choice of K = 20 candidate chunks for final
trained LLM rather than the retrieval backend. Thus, context formation, while computationally eficient for
the end-to-end generation quality gains can be directly maintaining sub-100ms response times, represents
anattributed to the gains in retrieval efectiveness. other domain-specific optimization that lacks theoretical
grounding for universal applicability.</p>
        <p>Method ROUGE-L BERTScore The system’s architectural dependencies introduce
BM25 0.46 0.68 additional constraints that become particularly
probDense Retrieval (no rerank) 0.74 0.79 lematic when considering cross-linguistic adaptation.
+ Difusion Re-Ranking 0.84 0.86 The reliance on the openai/clip-vit-base-patch32
embedding model, which produces 512-dimensional vectors
Table 6 optimized primarily for English text, creates a
fundaAutomatic generation-quality metrics for end-to-end RAG mental bottleneck for multilingual applications. This
systems. model’s training corpus exhibited limited exposure to
non-English languages, potentially compromising
semantic representation quality for languages with diferent
morphological complexity, syntactic structures, or
cul5. Conclusion tural contexts. The IVF_FLAT index configuration with
nlist=128 in the Milvus-Lite vector store, while adequate
We presented Difusion-Aided RAG , a novel pipeline for the current academic dataset, may require
signifithat couples dense retrieval with graph-based difusion cant recalibration for larger or more diverse document
reranking to improve the precision and contextual co- collections.
herence of Retrieval-Augmented Generation systems. By The intent classification module, despite achieving</p>
        <p>[1] A. Kucharavy, Fundamental limitations of
generative llms, in: Large Language Models in
Cybersecurity: Threats, Exposure and Mitigation, Springer</p>
        <p>Nature Switzerland Cham, 2024, pp. 55–64.
[2] P. Lewis, E. Perez, A. Piktus, F. Petroni,</p>
        <p>V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.</p>
        <p>Yih, T. Rocktäschel, et al., Retrieval-augmented
generation for knowledge-intensive nlp tasks,
Advances in neural information processing systems
33 (2020) 9459–9474.
[3] E. M. Voorhees, Natural language processing and
information retrieval, in: International summer
school on information extraction, Springer, 1999,
pp. 32–48.
[4] V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu,</p>
        <p>S. Edunov, D. Chen, W.-t. Yih, Dense passage
retrieval for open-domain question answering., in:</p>
        <p>EMNLP (1), 2020, pp. 6769–6781.
[5] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang,</p>
        <p>Retrieval augmented language model pre-training,
in: International conference on machine learning,</p>
        <p>PMLR, 2020, pp. 3929–3938.
[6] S. Wang, S. Zhuang, G. Zuccon, Bert-based dense
retrievers require interpolation with bm25 for
efective passage retrieval, in: Proceedings of the 2021
ACM SIGIR international conference on theory of
information retrieval, 2021, pp. 317–324.
[7] X. Ma, H. Fun, X. Yin, A. Mallia, J. Lin, Enhancing
sparse retrieval via unsupervised learning, in:
Proceedings of the Annual International ACM SIGIR
Conference on Research and Development in
Information Retrieval in the Asia Pacific Region, 2023,
pp. 150–157.
[8] Y. Li, Z. Liu, C. Xiong, Z. Liu, More robust dense
retrieval with contrastive dual learning, in:
Proceedings of the 2021 ACM SIGIR International
Conference on Theory of Information Retrieval, 2021,
pp. 287–296.
[9] M. Donoser, H. Bischof, Difusion processes for
retrieval revisited, in: Proceedings of the IEEE
conference on computer vision and pattern recognition,
2013, pp. 1320–1327.
[10] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao,</p>
        <p>D. Dong, H. Wu, H. Wang, Rocketqa: An optimized
training approach to dense passage retrieval for
6. Acknowledgments open-domain question answering, arXiv preprint
arXiv:2010.08191 (2020).</p>
        <p>This research is partially funded by PNRR project FAIR - [11] milvus-io, Milvus: Open Source Vector Database,
Future AI Research (PE00000013), Spoke 6 - Symbiotic AI 2025. URL: https://github.com/milvus-io/milvus.
[12] Google DeepMind, Gemini Pro, 2025. URL: https: URL: https://www.kaggle.com/datasets/grafstor/
//deepmind.google/models/gemini/pro/. simple-dialogs-for-chatbot?resource=download.
[13] Pallets Projects, Flask Documentation (stable), 2025. [27] M. Polignano, C. Musto, R. Pellungrini, E. Purificato,</p>
        <p>URL: https://flask.palletsprojects.com/en/stable/. G. Semeraro, M. Setzu, Xai.it 2024: An overview
[14] A. Radford, J. W. Kim, T. Xu, G. Brockman, on the future of AI in the era of large language
C. McLeavey, I. Sutskever, Robust speech recog- models, in: M. Polignano, C. Musto, R. Pellungrini,
nition via large-scale weak supervision, 2022. URL: E. Purificato, G. Semeraro, M. Setzu (Eds.),
Proceedhttps://arxiv.org/abs/2212.04356. doi:10.48550/ ings of the 5th Italian Workshop on Explainable
ARXIV.2212.04356. Artificial Intelligence, co-located with the 23rd
In[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, ternational Conference of the Italian Association
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, for Artificial Intelligence, Bolzano, Italy,
NovemRoberta: A robustly optimized BERT pretraining ber 26-27, 2024, volume 3839 of CEUR Workshop
approach, CoRR abs/1907.11692 (2019). URL: http: Proceedings, CEUR-WS.org, 2024, pp. 1–10. URL:
//arxiv.org/abs/1907.11692. arXiv:1907.11692. https://ceur-ws.org/Vol-3839/paper0.pdf .
[16] Google AI for Developers, Gemini API Reference, [28] F. Manco, D. Roberto, M. Polignano, G. Semeraro,
2025. URL: https://ai.google.dev/api?authuser=2&amp; JARVIS: adaptive dual-hemisphere architectures for
lang=python. personalized large agentic models, in: Adjunct
[17] Docling Team, Docling, https://github. Proceedings of the 33rd ACM Conference on User
com/docling-project/docling, 2024. URL: Modeling, Adaptation and Personalization, UMAP
https://arxiv.org/abs/2408.09869, arXiv preprint Adjunct 2025, New York City, NY, USA, June 16-19,
arXiv:2408.09869. 2025, ACM, 2025, pp. 72–76. URL: https://doi.org/
[18] OpenAI, CLIP ViT-B/32 Model, 2025. URL: https: 10.1145/3708319.3733674. doi:10.1145/3708319.</p>
        <p>//huggingface.co/openai/clip-vit-base-patch32. 3733674.
[19] NetworkX Developers, NetworkX: Network Analy- [29] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
sis in Python, 2025. URL: https://networkx.org/. G. Fiameni, G. Semeraro, Llamantino: Llama 2
mod[20] LangChain, Pypdfloader integration, els for efective text generation in italian language,
https://python.langchain.com/docs/integrations/ CoRR abs/2312.09993 (2023). URL: https://doi.org/
document_loaders/pypdfloader/, 2024. Accessed: 10.48550/arXiv.2312.09993. doi:10.48550/ARXIV.
2025-06-14. 2312.09993. arXiv:2312.09993.
[21] LangChain, Charactertextsplitter — langchain [30] M. Polignano, P. Basile, G. Semeraro, Advanced
api reference, https://python.langchain.com/api_ natural-based interaction for the italian language:
reference/text_splitters/character/langchain_text_ Llamantino-3-anita, CoRR abs/2405.07101 (2024).
splitters.character.CharacterTextSplitter.html, 2024. URL: https://doi.org/10.48550/arXiv.2405.07101.</p>
        <p>Accessed: 2025-06-14. doi:10.48550/ARXIV.2405.07101.
[22] LangChain, Faiss integration, https://python. arXiv:2405.07101.</p>
        <p>langchain.com/docs/integrations/vectorstores/
faiss/, 2024. Accessed: 2025-06-14.
[23] H. Face, S. Transformers, all-minilm-l6-v2, A. Online Resources
https://huggingface.co/sentence-transformers/
all-MiniLM-L6-v2, 2021. Accessed: 2025-06-14. The source code for the overall implementation for our
[24] M. AI, Llama 2 7b, https://huggingface.co/ project can be access through our GitHub repository.
meta-llama/Llama-2-7b, 2023. Accessed:
2025-0614. • GitHub
[25] M. Polignano, M. de Gemmis, G. Semeraro,
Unraveling the enigma of SPLIT in large-language
models: The unforeseen impact of system prompts
on llms with dissociative identity disorder, in:
F. Dell’Orletta, A. Lenci, S. Montemagni, R.
Sprugnoli (Eds.), Proceedings of the Tenth Italian
Conference on Computational Linguistics (CLiC-it 2024),
Pisa, Italy, December 4-6, 2024, volume 3878 of
CEUR Workshop Proceedings, CEUR-WS.org, 2024.</p>
        <p>URL: https://ceur-ws.org/Vol-3878/84_main_long.</p>
        <p>pdf .</p>
        <p>[26] grafstor, Simple Dialogs for Chatbot, 2025.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI), Gemini (Google), and
Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed
and take(s) full responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>