1. Introduction

Difusion-Aided RAG: Elevating Dense-Retrieval Chatbots via Graph-Based Difusion Reranking

Sai Teja Dampanaboina

Sai Nishchal Gamini

Karishma Kunwar

R@10 R@20 R@5 2

Marco Polignano

Marco Levantesi

1 2

Giovanni Semeraro

Ernesto William De Luca

1 2 0 George Eckert Institute , Brunswick , Germany 1 Leibniz Institute for Educational Media 2 Otto-von-Guericke University , Universitätspl. 2, 39106 Magdeburg , Germany 3 University Of Bari Aldo Moro , via E. Orabona 4, 70125, Bari , Italy

2025

This paper presents a comprehensive framework for enhancing dense-retrieval-based chatbots through the integration of graph-based difusion reranking. Addressing challenges in traditional retrieval-augmented generation (RAG) systems, the proposed methodology incorporates a multi-step pipeline that advances document retrieval and relevance ranking. Initially, candidate passages are retrieved via dense embeddings, followed by the construction of a graph representation that captures inter-passage semantic relationships. Through a graph-based difusion process, the reranking mechanism refines the selection, amplifying clusters of contextually relevant documents while mitigating noise efects from irrelevant data points. Experimental results demonstrate significant gains in retrieval quality and question-answering accuracy, underscoring the framework's potential for knowledge-intensive real-time applications such as conversational AI. This work reflects a pivotal step towards developing highly accurate, dynamic, and scalable multimodal conversational systems.

eol>Retrieval-Augmented Generation Large Language Models Chatbots Knowledge Graph PageRank

1. Introduction

the meaning of queries and documents. This method, called dense passage retrieval [4], makes it possible Advanced chatbots and other modern NLP tools need to find passages that are related in meaning even if they fast access to up-to-date, specific information. Although don’t share the same exact words. Still, pulling back the Large Language Models (LLMs) can generate fluent re- single best set of passages from an enormous database sponses and handle a wide range of topics, they’re stuck is tough, and the first batch of results often require adwith whatever they learned during training, and their ditional refinement to make sure they’re really on point. knowledge can become outdated or be too general [1]. That’s why it’s common to run additional steps like reRAG [2] solves this by enabling the LLM to retrieve infor- ranking to fine-tune and improve the final selection. mation from an external database that can be updated in We aim to make dense passage retrieval work even real time. This strategic decoupling of the LLM’s genera- better by using a multi-step pipeline. First, we pull an tive function from data management, including storage, initial batch of candidate passages with a dense retriever. indexing, and crucially, retrieval, allows for continuous Then we turn those top documents into a graph and run knowledge updates, thereby enhancing the responsive- a difusion process over it. This lets us capture how the ness, reliability, and domain fidelity of such systems. Be- passages relate to each other. By using this graph-based ing able to quickly and accurately find the right informa- difusion as a re-ranker, we can tweak the initial scores tion from a variety of sources is essential for powering so that the most truly relevant passages end up at the these next-generation NLP systems. top. The objective is to demonstrate how combining

Information Retrieval (IR) has evolved a lot to help us dense retrieval with graph-based difusion re-ranking ifnd relevant information more quickly and accurately can yield superior retrieval performance, providing a in huge collections of text [3]. Instead of just matching more accurate and contextually relevant set of documents words on the page, many modern systems use dense rep- essential for applications requiring dynamic knowledge resentations; basically, numeric embeddings that capture access.

In recent years, advances in Large Language Models (LLMs) and AI-driven dialog systems have enabled more dynamic, retrieval-augmented conversational platforms.

et al. in the REALM framework [5], which demonstrated the trieval accuracy. We have designed a web application and, efectiveness of retrieval-augmented language model pre- the processing pipeline consists of six sequential stages: training by fine-tuning on open-domain question answer- input acquisition, intent classification, intent-based routing (Q&A). At inference time, REALM fetches documents ing, dense retrieval, graph-based re-ranking, and large using dense embeddings and conditions the generator language model (LLM) response generation. All inference on retrieved passages. Building on this idea, Lewis et components are deployed on a GPU when available, with al. formalized the Retrieval-Augmented Generation a fallback to CPU. A local Milvus-Lite instance serves (RAG) architecture [2] , showing that coupling dense as the vector store [11], and Google’s Gemini Pro model retrieval with a pretrained sequence-to-sequence model [12] functions as the core LLM. improves factual grounding and generalization in Q&A.

Unlike traditional LLMs that rely solely on parametric 3.1. System Architecture memory, RAG leverages a non-parametric index to fetch up-to-date, domain-specific information during genera- The chatbot is implemented as a modular Flask server tion. [13] that listens for cross-origin requests. Upon ini

Prior to dense retrieval, sparse vector-space meth- tialization, the server launches a Milvus-Lite instance, ods—such as TF-IDF or BM25—were the de facto stan- creating or loading a collection named rag_collection dards for fetching relevant documents [6]. Although into memory from a persistent storage directory (./milBM25 performs well on short, keyword-based queries, vus_data). Simultaneously, several models are pre-loaded it struggles with semantic matching in open-domain to minimize inference latency: a. Speech-to-Text: The contexts[7]. Karpukhin et al. [4] showed that a dual- OpenAI Whisper medium model (769M parameters)[14], encoder dense retrieval model, trained on relatively few b. Intent Classification: A LoRA-fine-tuned RoBERTaquestion–passage pairs, could outperform a strong BM25 base model [15], c. Language Generation: The Google baseline. Subsequent work by Xiong et al. [8] and Qu Gemini client, configured via an API key [ 16]. The emet al. [9] confirmed that dense retrievers better handle bedding model, openai/clip-vit-base-patch32, is paraphrased, abstract, and long-tail queries. These stud- loaded. The system’s behavior can be dynamically ies also highlighted challenges in dense retrieval—such altered via a dedicated API endpoint that toggles a as selecting hard negatives and mitigating false nega- "GLOBAL_SEARCH_MODE" flag, forcing all queries to tives—and proposed improvements in training objectives be routed to the web search module, thereby bypassing and negative sampling strategies. intent classification. API keys for Gemini and SerpAPI

Despite these advancements, the top-k passages re- are managed as environment variables. turned by a dense retriever may include semantically similar but contextually irrelevant documents. To address this, our work introduces a graph-based difusion re-ranking step over the initial dense retrieval results. This idea is inspired by Donoser and Bischof’s difusion process for visual retrieval [ 10], where each document is treated as a node in a similarity graph and scores propagate through edges to refine ranking. We adapt this difusion-based re-ranking to text-based retrieval by constructing a graph over the top retrieved chunks and iteratively propagating similarity scores to emphasize manifold structure rather than relying solely on pairwise dot products.

However, to our knowledge, prior RAG-style systems have not integrated graph-based difusion re-ranking to refine their dense retrieval outputs. In this paper, we propose such an integration and demonstrate its efectiveness on benchmark Q&A datasets.

3. Methodology This section details the design and implementation of our

dense-retrieval chatbot. The system employs a graphbased difusion re-ranking mechanism to enhance re3.2. Corpus Construction and Indexing to web search, it fetches the top 20 results, forwards to the LLM along with the conversation history and the The knowledge base for Retrieval-Augmented Gener- LLM generates the response. If the query is directed to ation (RAG) is derived from a collection of PDF and the RAG retriever, the dense retriever and page re-ranker plain-text documents stored in a designated directory. comes into play which retrieves the relevant document An ofline ingestion script (ingest_embeddings.py) pro- chunks from the vector database and forwards them to cesses these sources into a searchable vector index. LLM for it to generate a response. A global flag can overFirstly, PDF documents are converted to Markdown us- ride this logic and force any query to use the Web Search ing the Docling library[17], with OCR enabled to extract path. When enabled, even queries that would normally text from scanned pages. Plain-text files are read di- directed to the RAG Retriever or go straight to the LLM rectly. The Markdown content is first segmented into are redirected to fetch live results via the SERP API. This logical blocks (e.g., headings, paragraphs, table rows). ensures that all responses are grounded in the most up-toThese blocks are then aggregated into chunks of up date information available. This is ideal for time-sensitive to 500 words with a 50-word overlap between consec- domains like news, finance, or rapidly evolving technical utive chunks. This overlap strategy ensures contex- ifelds. tual continuity across chunk boundaries. Each text chunk is embedded using the Hugging Face implementation of openai/clip-vit-base-patch32 [18]. The 3.4. RAG Search: Dense Retrieval and get_text_features() method produces a 512-dimensional Difusion Reranking vector, which is then normalized to unit ℓ2 norm. The resulting embedding vectors are indexed in the Milvus-Lite rag_collection. Each entry includes the vector (emb) and associated metadata: source_path, a unique chunk_id, the full chunk_text, and a 200-character chunk_preview.

An IVF_FLAT index is built on the embedding field with nlist = 128, partitioning the vector space to accelerate searches. The entire index is loaded into memory for high-speed nearest-neighbor lookups.

3.3. Core Processing Pipeline Incoming user requests, whether text or speech, trigger

a multi-stage process to generate a contextually relevant response. The system acquires user input through two primary endpoints: a speech input API that transcribes audio files using a Whisper model [ 14], and a text input API that accepts JSON payloads with the conversation history [1]. Once the user’s query is obtained, it undergoes intent classification by a fine-tuned RoBERTa-base model (which has been fine-tuned by us using Low- Rank Adaptation (LoRA) technique on a synthetic dataset curated by us which is used for training, categorizes the text as a"RAG Search", "Web Search", "Greeting" or "Conversation Meta".

This classification model was optimized using Low Rank Adaptors with a rank of r=8 and a scaling factor of =16, applied to the query and key projection matrices. Based on the resulting intent, the query is routed down one of three paths: "Greeting and Conversation Meta" intents bypass retrieval and generate a direct response from the role of the LLM and conversation history respectively; a "Web Search" classification triggers a webaugmented generation path; lastly, the "RAG Search" intent activates a dense retrieval and re-ranking pipeline for a RAG-augmented response. When the query is directed

For queries that are classified as "RAG Search" requir

ing information from the internal knowledge base, the system executes a sophisticated retrieval and reranking process. During initial retrieval, the raw query is embedded using the openai/clip-vit-base-patch32[18] model to produce a 512-dimensional query vector, vec. This vector is used to search the Milvus collection for the top 50 most similar chunks based on inner-product similarity, with search parameter nprobe=10. The top (up to 50) candidate chunks are used to construct a weighted, undirected graph G = (V, E), where each node ∈ represents a candidate chunk. An edge (, ) ∈ is created for every pair of nodes, with its weight set to the cosine similarity between their respective embedding vectors. This results in a complete graph that captures the semantic manifold of the candidate set.

To refine the initial ranking, we employ personalized PageRank (Difusion) . A personalization vector p is constructed directly from the raw dense retrieval scores of the candidate chunks, where each component is proportional to the initial dense retrieval score of candidate . Thus, p is neither empty nor randomly initialized—it is deterministically defined by normalizing the retrieval scores, ensuring higher-scored chunks receive greater weight: p ̸= 0, ∑︁ = 1.

=1 = ∑︀ =1 , = 1, . . . , , (1)

This vector biases the random walk towards candidates that were originally most relevant chunks standing before we apply the graph difusion step to the query. The final PageRank scores, ∈ IRn, are computed iteratively via the NetworkX library [19], solving the equation: "text": Role + conversation history, "rawQuery": User Query, "skipApiKeyValidation": false

For RAG Search, the formatted top-20 re-ranked chunks are appended under a Retrieved Context: heading

"text": Role + conversation history, "rawQuery": User Query, "skipApiKeyValidation": false "Retrieved Context": Top 20 Chunks

For Web Search, the serialized JSON from SerpAPI is appended under a Web Search Results: heading in the JSON file.

avoid getting stuck in tight clusters. Values much higher in the JSON file. [9], striking a balance between “walking” the similarity graph (propagating scores along edges) and “teleporting” back to the seed nodes (initial retrieval scores) to (>0.9) can slow convergence and over-emphasize dense subgraphs; values much lower (<0.7) behave more like1 { pure retrieval without graph smoothing. This difusion 2 process up-ranks candidates that belong to dense, seman-3 tically coherent clusters within the graph, mitigating the4 risk of relying on isolated high-similarity outliers. For5 context formualtion between the selected candidates, The6 } candidates are sorted by their final PageRank scores in descending order, and the top K = 20 chunks are selected to form the Retrieved Context. If the re-ranking step is disabled or fails, the system falls back to the top 20 candidates from the initial dense retrieval. The top 201 { candidates are selected because with trial and error we2 have decided that selecting 20 number of candidates to3 pass through the LLM is suficient to cover the enough 4 potential context so that the relevant bits are not lost,5 but to avoid dragging too many of topic chunks that dilute the difusion signal. Also, a 20-node graph is small 6 } enough for sub-100 ms difusion passes, keeping end-toend latency low. If the collection is huge, increasing the

3.5. Web Search Augmentation

For queries with the Web Search intent, the system queries the Google Search engine via the SerpAPI. The query retrieves the "answer box" and up to 20 top organic results. The structured JSON response from the API is serialized into a string. If the API call fails, the process continues without web context.

3.6. LLM Prompting and Response Generation All prompts are submitted to the gemini-2.5-pro

model [12]. The final prompt is dynamically assembled based on the routing path: Every prompt begins with a ifxed role definition and the current conversation history.

For example the payload JSON file would look like as

follows. In place of Role we would define the role of the

LLM to give it a persona and in place of conversation history we would have the conversations between the User and the LLM.

"text": Role + conversation history, "rawQuery": User Query, "skipApiKeyValidation": false "Web Search Results": Top 20 search results number of top candidates to pass to the LLM would be tional context is added. The final composite prompt is recommended.

For Greeting and Conversation Meta intents, no addisent to the Gemini API. The extracted text from the response is returned to the client in a JSON object containing the reply and the original intent. 4. Experiment To evaluate the eficacy of our proposed chatbot,

particularly the contribution of graph-based difusion reranking, we designed a series of experiments. Our evaluation aims to answer three primary research questions: RQ1: Component Eficacy:

How accurately does the intent classification module route user queries to the appropriate processing pipeline?

RQ2: Retrieval Efectiveness: graph-based difusion reranking significantly improve the quality of retrieved documents compared to standard dense retrieval baselines? RQ3: End-to-End Performance: Does the enhanced retrieval quality from our reranking module translate into more accurate, faithful, and helpful final responses generated by the LLM?

Does the proposed This section details the experimental setup, the datasets used, the baselines for comparison, the evaluation metrics, and a thorough analysis of the results. 4.1. Experimental Setup

4.1.1. Dataset Construction github repository.

We followed the similar procedure to generate the dataset for training (a hybrid dataset, some elements of the dataset are also taken from the publicly available dataset [26]) and evaluation dataset for the intent classifier. The scripts for the evaluation data and training dataset can be found in the publicly available script in the github repository.

To perform a realistic evaluation, we constructed a domain-specific question-answering dataset tailored to the Otto von Guericke University (OVGU) context in En- 4.1.2. Implementation Details glish language, but same process can be followed for any All experiments were conducted on a single machine other application domain. This reflects a practical ap- equipped with a Ryzen 7 7800H, NVIDIA RTX 4060 plication scenario where students frequently seek quick, GPU with 8GB VRAM and 16 GB of RAM. The system reliable answers to academic queries,such as course de- implementation uses the library versions specified retails, procedures or administrative processes which are quriements.txt file. The key hyperparameters for the typically spread across the university website and oficial RAG pipeline, including =0.85 for PageRank and K=20 documents. The dataset was created as follows. for the number of retrieved chunks, were kept constant

We generated a retrieval-augmented question–answer across all experiments. (QA) dataset directly from our institutional PDF regulations and module handbooks using an end-to-end open-source pipeline. First, all PDF files were loaded via 4.2. Intent Classification Fine-Tuning LangChain’s PyPDFLoader [20] and split into overlap- We fine-tuned RoBERTa-base for intent classification ping text chunks (1000 characters, 200 characters over- using a parameter-eficient LoRA setup. Our pipeline lap) with CharacterTextSplitter [21]. Each chunk was comprises dataset preparation, LoRA integration, trainencoded into a FAISS vector store [22] using sentence- ing, and evaluation. A CSV dataset of user queries transformer embeddings (all-MiniLM-L6-v2) [23]. (Question) and intent labels (Label) was loaded, labelTo produce questions, we initialized a local causal encoded, and split 80/20 (seed 42) in a stratified fashLLM (Llama-2-7B via Hugging Face’s text-generation ion. Queries were tokenized with RoBERTa’s tokpipeline) wrapped by LangChain’s HuggingFacePipeline enizer (max length 64), producing input_ids and [24]. However, any other embedding strategy or LLM attention_mask fields wrapped in Hugging Face could be used [25]. A few-shot prompt — “Given the Dataset objects. We loaded roberta-base configured following excerpt, generate n unique, questions answer- for intents and applied LoRA adapters (PEFT) to the able from this content” — was applied to each chunk (n query and key projections with rank = 8, = 16, = 2). Generated questions were de-duplicated in a case- and dropout = 0.05, freezing all other model weights. insensitive manner, yielding a pool of 80 unique ques- Fine-tuning ran on GPU (or CPU) with seed 42. Key hytions.For each question, we ran a retrieval-augmented perparameters: We used Hugging Face’s Trainer with QA chain: the FAISS retriever returned the top k = 4 most relevant chunks, and the LLM instantiated a “stuf”-type Hyperparameter Value chain to produce concise answers, each appended with inline citations pointing to the source document chunk. Learning rate 2 × 10− 5 All Q&A pairs were compiled into a final CSV (ques- BWaeticghhtsidzeecay 80.01 tion,answer) named RAG_evaluation_Dataset.csv, result- Epochs 3 ing in 80 high-quality, syllabus-grounded items. Our Evaluation strategy End of each epoch fully local workflow relies exclusively on open-source Checkpoint retention Last two checkpoints models (sentence-transformers for embeddings; Hug- Selection criterion Best validation accuracy ging Face model for LLM) and FAISS for vector retrieval, Logging frequency Every 50 steps ensuring reproducibility and data privacy. All hyperparameters (chunk size, overlap, k, temperature = 0.3, Table 1 max_new_tokens = 512) are documented in our publicly Fine-tuning hyperparameters for intent classification. available script. The resulted csv file is manually verified and introduced with typos into question to add noise to an accuracy metric (scikit-learn). The best checkpoint (by the query to simulate the real world queries. Also we validation accuracy) was evaluated on the held-out split. have manually rechecked the answer by going through For inference, inputs are tokenized to length 64, passed the utilized documents.The dataset can be found in our through the model, and predicted indices are mapped back to label strings. This LoRA-based approach updates a small fraction of parameters, yielding fast convergence and lightweight deployment.

4.3. Baselines and System Variants

4.4.3. End-to-End Response Quality

To answer RQ2, we evaluated the final generated re

sponses. Automatic Metrics like ROUGE-L (Measures n-gram overlap with the reference answer, focusing on recall.), BERTScore (Computes the semantic similarity between the generated response and the reference answer using contextual embeddings.) have been considered.

We compared our full system against several baselines

and ablations to isolate the impact of our contributions.

Lexical Baseline (BM25): A classic sparse retrieval system using TF-IDF with the Okapi BM25 algorithm. This represents a traditional, non-neural IR baseline. Dense Retrieval Baseline (Dense-NoRerank): This system uses the same CLIP-based query embedding and Mil- 4.5. Results and Analysis vus index as our proposed method but omits the graphreranking step. It simply takes the top-K results based on raw inner product similarity. This serves as our primary ablation to directly measure the impact of difusion reranking. Proposed System (Dense-Rerank): Our full RAG pipeline as described in Section II, which includes initial dense retrieval followed by graph-based difusion reranking. For end-to-end evaluation, the retrieved context from each of these three systems is fed into the same Gemini-2.5-pro model [12] using an identical prompt 1.4 structure. 1.2 4.5.1. Intent Classification Performance (RQ1)

We finetuned the model for three full epochs using a

linear learning-rate schedule from 2 × 10− 5 down to 0. Figures 2–5 summarize key training diagnostics. The details on finetuning of the model can be seen in the section 4.2

Training Loss vs. Epoch

Evaluation Accuracy vs. Epoch 1.00 1.25 1.50 1.75 2.00 Epoch 2.25 2.50 2.75 3.00 4.4.2. Retrieval Performance

To answer RQ1, we have evaluated the quality of the ranked list of documents returned by each retrieval system against the annotated ground-truth chunks.

Normalized Discounted Cumulative Gain (nDCG@K): Measures the quality of the ranking, rewarding systems that place highly relevant documents at the top of the list. We reported nDCG@5, nDCG@10, and nDCG@20.

Mean Reciprocal Rank (MRR): Measures the average reciprocal rank of the first relevant document. It is particularly sensitive to how high the very first correct answer is ranked.

Recall@K: The proportion of relevant documents found 1.0 s so0.8 L g n i ian0.6 r T 0.4 0.2 0.9996 y0.9994 rcccauA iton0.9992 a u l a v E0.9990 0.9988

4.4. Evaluation Metrics We employed an automatic evaluation metric to assess performance at diferent stages of the pipeline.

4.4.1. Intent Classification Wstaenedvaarlduamteedtrtihces oLnoRaAh-teuldn-eoduRtoteBsEt RsTetafcrloamssiofieurrusainnngo- 0.0 0.5 1.0 E1p.o5ch 2.0 2.5 3.0 tated dataset. Accuracy: Overall percentage of correctly Figure 2: Training loss as a function of epoch. Loss fell precipclassified intents. Macro-F1 Score: The unweighted itously from 1.38 to 0.05 within the first half-epoch and then mean of the F1-scores for each of the four intent classes, decayed asymptotically toward zero by epoch 3. providing a balanced measure of performance.

1e 5

Learning Rate vs. Epoch such as RAG Search, Web Search, Greeting, Conversation_Meta has 4K rows, to evaluate the model right after ifnetuning and a custom made external dataset with the real world queries which is constructed with the same procedure mentioned in 4.1.1 to check the confusion matrix apart from the confusion matrix generated from the synthetic dataset. The discussed external dataset has typos which generally seen in the real world usage.

Table 2 reports Accuracy and Macro-F1 on the heldout portion of our annotated dataset. Table 3 and Table 4 shows the corresponding 4×4 confusion matrix (true ∖ predicted).

After finetuning the model, we have tested it in two ways, using an previously discussed synthetic dataset, which has 16K rows, where each classification On the annotated dataset, we achieve 99.88% Accuracy

The rapid decline in training loss (Fig. 2) demonstrates and Macro-F1, with only four misclassifications (all in that the model quickly learns low-level patterns. the “greeting” intent). On the external test dataset , we Evaluation accuracy (Fig. 3) increases monotonically, observe perfect scores with no of-diagonal errors. These from 99.875% to 99.96875%, while evaluation loss falls results indicate that our LoRA-tuned RoBERTa model is from 0.00416 to 0.00123, indicating continued but highly reliable for routing user utterances to their correct diminishing generalization improvements across epochs. intents under both in-domain and held-out conditions. The learning-rate schedule (Fig. 4) balances coarse early updates and fine-tuning in later epochs, and the gradient 4.5.2. Retrieval Efectiveness (RQ2) norms (Fig. 5) confirm that the optimizer transitions smoothly from high-magnitude updates to stable, small magnitudes without oscillation or divergence. Overall, these results validate our choice of schedule and training regime, showing strong convergence with minimal overfitting.

We tested the retriever with the custom dataset that we

have discussed earlier in 4.1.1.

The difusion-based reranking step yields a substantial lift over plain dense retrieval: Early-rank gains: nDCG@5 increases from 0.82 to 0.90 (+9.8%), and MRR from 0.88 to 0.95 (+8.0%), showing that the first relevant chunk is more consistently ranked at the very top.

Broader coverage: Recall@5 improves from 0.90 to 0.94, indicating almost majority of the relevant passages are captured within the top 5 results.

Method BM25 Dense Retrieval (no rerank) + Difusion Re-Ranking MRR

This improvement stems from the Personalized PageR- constructing a semantic similarity graph over the top- ank difusion over the dense-embedding graph: Clus- candidate chunks and applying a personalized PageRank ter promotion: Semantically coherent clusters of chunks difusion, our method consistently boosts early-rank remutually reinforce each other, raising their rank. Noise trieval metrics (nDCG@5, MRR) and broad recall (R@20), suppression: Isolated or tangential hits receive little difu- translating directly into higher ROUGE-L and BERTScore sion signal and therefore drop down the list. As a result, on end-to-end QA generation. The framework is eficient Dense-Rerank not only boosts the presence of highly rel- enough for real-time applications, relies on open-source evant documents at top positions (driving up nDCG and components (Milvus, CLIP, Gemini), and demonstrates MRR) but also enhances overall recall within the critical robustness across both synthetic and external query sets. early ranks.

5.1. Limitations

4.5.3. End-to-End Generation Quality (RQ3)

The current Difusion-Aided RAG framework, while

The improvements in automatic metrics mirror our re- demonstrating significant improvements in retrieval eftrieval findings (RQ2): Faithfulness and Helpfulness: fectiveness and generation quality, exhibits several critHigher ROUGE-L and BERTScore for Dense-Rerank in- ical limitations that warrant careful consideration for dicate more accurate and relevant content generation, broader deployment and cross-linguistic applications. thanks to the superior top-K retrieval. Retrieval → The most pronounced limitation concerns hyperparameGeneration Link: RQ2 showed that difusion reranking ter sensitivity, particularly regarding the damping factor promotes centrally relevant chunks; RQ3 demonstrates = 0.85 employed in the personalized PageRank difuthat feeding those higher-quality chunks into the LLM sion process. This parameter, borrowed from the canonyields outputs that better match reference texts (ROUGE- ical PageRank algorithm, was empirically validated on L) and higher semantic overlap (BERTScore). Fluency: the OVGU academic dataset but may exhibit suboptimal We observed similar fluency across all three systems (not performance across diferent domains or linguistic conshown), as fluency is primarily governed by the pre- texts. The choice of K = 20 candidate chunks for final trained LLM rather than the retrieval backend. Thus, context formation, while computationally eficient for the end-to-end generation quality gains can be directly maintaining sub-100ms response times, represents anattributed to the gains in retrieval efectiveness. other domain-specific optimization that lacks theoretical grounding for universal applicability.

Method ROUGE-L BERTScore The system’s architectural dependencies introduce BM25 0.46 0.68 additional constraints that become particularly probDense Retrieval (no rerank) 0.74 0.79 lematic when considering cross-linguistic adaptation. + Difusion Re-Ranking 0.84 0.86 The reliance on the openai/clip-vit-base-patch32 embedding model, which produces 512-dimensional vectors Table 6 optimized primarily for English text, creates a fundaAutomatic generation-quality metrics for end-to-end RAG mental bottleneck for multilingual applications. This systems. model’s training corpus exhibited limited exposure to non-English languages, potentially compromising semantic representation quality for languages with diferent morphological complexity, syntactic structures, or cul5. Conclusion tural contexts. The IVF_FLAT index configuration with nlist=128 in the Milvus-Lite vector store, while adequate We presented Difusion-Aided RAG , a novel pipeline for the current academic dataset, may require signifithat couples dense retrieval with graph-based difusion cant recalibration for larger or more diverse document reranking to improve the precision and contextual co- collections. herence of Retrieval-Augmented Generation systems. By The intent classification module, despite achieving

[1] A. Kucharavy, Fundamental limitations of generative llms, in: Large Language Models in Cybersecurity: Threats, Exposure and Mitigation, Springer

Nature Switzerland Cham, 2024, pp. 55–64. [2] P. Lewis, E. Perez, A. Piktus, F. Petroni,

V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.

Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in neural information processing systems 33 (2020) 9459–9474. [3] E. M. Voorhees, Natural language processing and information retrieval, in: International summer school on information extraction, Springer, 1999, pp. 32–48. [4] V. Karpukhin, B. Oguz, S. Min, P. S. Lewis, L. Wu,

S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering., in:

EMNLP (1), 2020, pp. 6769–6781. [5] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang,

Retrieval augmented language model pre-training, in: International conference on machine learning,

PMLR, 2020, pp. 3929–3938. [6] S. Wang, S. Zhuang, G. Zuccon, Bert-based dense retrievers require interpolation with bm25 for efective passage retrieval, in: Proceedings of the 2021 ACM SIGIR international conference on theory of information retrieval, 2021, pp. 317–324. [7] X. Ma, H. Fun, X. Yin, A. Mallia, J. Lin, Enhancing sparse retrieval via unsupervised learning, in: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2023, pp. 150–157. [8] Y. Li, Z. Liu, C. Xiong, Z. Liu, More robust dense retrieval with contrastive dual learning, in: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 2021, pp. 287–296. [9] M. Donoser, H. Bischof, Difusion processes for retrieval revisited, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1320–1327. [10] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao,

D. Dong, H. Wu, H. Wang, Rocketqa: An optimized training approach to dense passage retrieval for 6. Acknowledgments open-domain question answering, arXiv preprint arXiv:2010.08191 (2020).

This research is partially funded by PNRR project FAIR - [11] milvus-io, Milvus: Open Source Vector Database, Future AI Research (PE00000013), Spoke 6 - Symbiotic AI 2025. URL: https://github.com/milvus-io/milvus. [12] Google DeepMind, Gemini Pro, 2025. URL: https: URL: https://www.kaggle.com/datasets/grafstor/ //deepmind.google/models/gemini/pro/. simple-dialogs-for-chatbot?resource=download. [13] Pallets Projects, Flask Documentation (stable), 2025. [27] M. Polignano, C. Musto, R. Pellungrini, E. Purificato,

URL: https://flask.palletsprojects.com/en/stable/. G. Semeraro, M. Setzu, Xai.it 2024: An overview [14] A. Radford, J. W. Kim, T. Xu, G. Brockman, on the future of AI in the era of large language C. McLeavey, I. Sutskever, Robust speech recog- models, in: M. Polignano, C. Musto, R. Pellungrini, nition via large-scale weak supervision, 2022. URL: E. Purificato, G. Semeraro, M. Setzu (Eds.), Proceedhttps://arxiv.org/abs/2212.04356. doi:10.48550/ ings of the 5th Italian Workshop on Explainable ARXIV.2212.04356. Artificial Intelligence, co-located with the 23rd In[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, ternational Conference of the Italian Association O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, for Artificial Intelligence, Bolzano, Italy, NovemRoberta: A robustly optimized BERT pretraining ber 26-27, 2024, volume 3839 of CEUR Workshop approach, CoRR abs/1907.11692 (2019). URL: http: Proceedings, CEUR-WS.org, 2024, pp. 1–10. URL: //arxiv.org/abs/1907.11692. arXiv:1907.11692. https://ceur-ws.org/Vol-3839/paper0.pdf . [16] Google AI for Developers, Gemini API Reference, [28] F. Manco, D. Roberto, M. Polignano, G. Semeraro, 2025. URL: https://ai.google.dev/api?authuser=2& JARVIS: adaptive dual-hemisphere architectures for lang=python. personalized large agentic models, in: Adjunct [17] Docling Team, Docling, https://github. Proceedings of the 33rd ACM Conference on User com/docling-project/docling, 2024. URL: Modeling, Adaptation and Personalization, UMAP https://arxiv.org/abs/2408.09869, arXiv preprint Adjunct 2025, New York City, NY, USA, June 16-19, arXiv:2408.09869. 2025, ACM, 2025, pp. 72–76. URL: https://doi.org/ [18] OpenAI, CLIP ViT-B/32 Model, 2025. URL: https: 10.1145/3708319.3733674. doi:10.1145/3708319.

//huggingface.co/openai/clip-vit-base-patch32. 3733674. [19] NetworkX Developers, NetworkX: Network Analy- [29] P. Basile, E. Musacchio, M. Polignano, L. Siciliani, sis in Python, 2025. URL: https://networkx.org/. G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod[20] LangChain, Pypdfloader integration, els for efective text generation in italian language, https://python.langchain.com/docs/integrations/ CoRR abs/2312.09993 (2023). URL: https://doi.org/ document_loaders/pypdfloader/, 2024. Accessed: 10.48550/arXiv.2312.09993. doi:10.48550/ARXIV. 2025-06-14. 2312.09993. arXiv:2312.09993. [21] LangChain, Charactertextsplitter — langchain [30] M. Polignano, P. Basile, G. Semeraro, Advanced api reference, https://python.langchain.com/api_ natural-based interaction for the italian language: reference/text_splitters/character/langchain_text_ Llamantino-3-anita, CoRR abs/2405.07101 (2024). splitters.character.CharacterTextSplitter.html, 2024. URL: https://doi.org/10.48550/arXiv.2405.07101.

Accessed: 2025-06-14. doi:10.48550/ARXIV.2405.07101. [22] LangChain, Faiss integration, https://python. arXiv:2405.07101.

langchain.com/docs/integrations/vectorstores/ faiss/, 2024. Accessed: 2025-06-14. [23] H. Face, S. Transformers, all-minilm-l6-v2, A. Online Resources https://huggingface.co/sentence-transformers/ all-MiniLM-L6-v2, 2021. Accessed: 2025-06-14. The source code for the overall implementation for our [24] M. AI, Llama 2 7b, https://huggingface.co/ project can be access through our GitHub repository. meta-llama/Llama-2-7b, 2023. Accessed: 2025-0614. • GitHub [25] M. Polignano, M. de Gemmis, G. Semeraro, Unraveling the enigma of SPLIT in large-language models: The unforeseen impact of system prompts on llms with dissociative identity disorder, in: F. Dell’Orletta, A. Lenci, S. Montemagni, R. Sprugnoli (Eds.), Proceedings of the Tenth Italian Conference on Computational Linguistics (CLiC-it 2024), Pisa, Italy, December 4-6, 2024, volume 3878 of CEUR Workshop Proceedings, CEUR-WS.org, 2024.

URL: https://ceur-ws.org/Vol-3878/84_main_long.

pdf .

[26] grafstor, Simple Dialogs for Chatbot, 2025.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI), Gemini (Google), and Grammarly in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.