<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalable Multimodal RAG Systems: Integrating AI for Adaptive Information Retrieval and Generation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denys Yuvzhenko</string-name>
          <email>d.yuvzhenko@kpi.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktor Putrenko</string-name>
          <email>putrenko10@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergii Lupenko</string-name>
          <email>serhii.lupenko@auk.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nataliia Pashynska</string-name>
          <email>nat.pashynska@gmail.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>American University Kyiv</institution>
          ,
          <addr-line>Poshtova Square, 3, 04070, Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Igor Sikorsky KPI</institution>
          ,
          <addr-line>37, Prospect Beresteiskyi, 03056, Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Opole University of Technology</institution>
          ,
          <addr-line>Mikołajczyka Street 1645-271 Opole</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Taras Shevchenko National University of Kyiv 60</institution>
          ,
          <addr-line>Volodymyrska Street, 01033, Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This study presents an asynchronous, web-based Retrieval-Augmented Generation (RAG) system that integrates multimodal inputs (text, images, tables) to enhance information retrieval and generation in various contexts. The system is developed in Python and hosted on AWS, combining Chroma DB for vector storage and Anthropic's Claude-3-Haiku LLM model accessible via AWS Bedrock. By leveraging modern cloud capabilities, the solution scales efficiently and handles diverse data modalities in real time. Through systematic experiments, this project highlights the effectiveness of multimodal embedding techniques for refining retrieval accuracy and providing context-aware responses. The architecture's modular design supports seamless feature integration, making it adaptable for different use cases such as customer support, educational tools, and content creation. Key findings emphasize the role of vector databases in dynamic information updates and confirm that large language models, when appropriately curated and grounded, can produce highquality, relevant outputs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generative artificial intelligence</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
        <kwd>information systems 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Today’s AI-driven solutions are rapidly reshaping how we find, process, and interpret information
across diverse fields – ranging from business intelligence and customer support to research and
beyond [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Key to these advancements is the emergence of powerful techniques like
RetrievalAugmented Generation (RAG), which seamlessly merges real-time data retrieval with generative
capabilities to produce context-rich outputs.
      </p>
      <p>
        Central to this project is the design and implementation of a multimodal RAG [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] system. By
leveraging a vector database for efficient data storage and retrieval, coupled with an advanced LLM
for generation, the system can handle various data formats (text, images, tables) asynchronously.
This approach offers scalable, on-demand insights applicable to everything from market analysis and
automated content creation to domain-specific research.
      </p>
      <p>Beyond its technical merits, the system highlights important considerations around security, data
privacy, and responsible AI practices. These factors are becoming increasingly critical as
organizations integrate AI solutions into their workflows, whether to streamline customer service,
refine decision-making processes, or create adaptive learning tools.</p>
      <p>Overall, this diploma project underscores the transformative potential of integrating modern AI
architectures into real-world operations. By showcasing flexible data handling, robust retrieval, and
domain-agnostic generation, it offers a blueprint for harnessing AI’s capabilities across industries –
paving the way for more responsive, intelligent applications that drive innovation and efficiency at
scale.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Overview of Retrieval-Augmented Generation (RAG)</title>
      <p>
        Retrieval-Augmented Generation (RAG) is an approach that combines a knowledge retrieval
mechanism with a generative model to produce contextually accurate and up-to-date outputs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Instead of relying solely on a pre-trained model’s internal parameters, RAG pulls relevant data from
external sources – often stored in vector databases – before generating a response. This setup helps
deliver more precise, domain-focused information for tasks as varied as customer support, data
analysis, market research, and content generation.
      </p>
      <p>
        In practice, RAG typically operates in two main phases: retrieval and generation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The retrieval
component searches through indexed material (e.g., documents, FAQs, or multimodal data stored in a
vector database like Chroma DB) to identify the best matching context. The generation component
then uses that context to craft a response with greater relevance than a standalone Large Language
Model (LLM) might achieve. This architecture makes RAG systems highly adaptable, enabling them
to serve unique needs across various industries – such as real-time market trend analysis, complex
report generation, or specialized research tasks.
      </p>
      <p>By integrating retrieval and generation in a unified pipeline, RAG solutions can provide
responsive, high-fidelity answers while reducing hallucinations and inaccuracies. As companies and
organizations continue to explore AI-driven workflows, RAG offers a powerful framework for
building applications that harness large language models – like Claude-3-Haiku – while maintaining
a sharp focus on the most relevant data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Multimodal Approaches in Modern AI Systems</title>
      <p>
        Multimodal AI systems integrate various data types – such as text, images, audio, or even sensor
readings – to deliver richer, more context-aware responses [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Instead of processing a single input
channel, these systems correlate multiple sources of information to improve accuracy, enhance user
experience, and unlock use cases spanning industry, research, and consumer applications.
      </p>
      <p>By examining details from different modalities, AI models can capture nuances that a text-only or
image-only system might miss. For example, a support bot might leverage text transcripts alongside
product images to better understand and troubleshoot user queries, while a market analysis tool
could fuse social media sentiment (text) with satellite imagery of store traffic to gain a competitive
edge. In research contexts, combining scientific papers (text) with lab instrument readings (numeric
data) or microscopy images (visual data) can accelerate discoveries and help validate hypotheses
more efficiently.</p>
      <p>Multimodal setups often involve specialized pipelines for each data format. After preprocessing,
features are merged through a unified embedding space or a transformer-based architecture that can
handle multiple input types concurrently. This enables systems to dynamically adapt to diverse data
sources, delivering responses that closely align with real-world scenarios. As the field evolves, the
integration of multimodal techniques with Retrieval-Augmented Generation (RAG) further refines
content accuracy and extends application reach – ensuring that high-quality, tailored information is
available to users across a wide array of domains and platforms.</p>
      <p>A crucial aspect of building these systems is storing and managing multimodal data in vector
databases. In practice, each data type (e.g., text, image, audio) is transformed into a high-dimensional
numeric embedding before being indexed. A database such as Chroma DB then uses similarity search
methods (often approximate nearest neighbor techniques) to quickly retrieve the most relevant
embeddings. For instance, an e-commerce platform might store both product descriptions (text) and
images (visual data) as separate but interlinked embeddings, allowing a combined search approach
that yields more precise results when generating recommendations.</p>
      <p>Once the relevant embeddings are identified, the Retrieval-Augmented Generation (RAG) pipeline
fuses them into a unified context for a generative model. The model, aware of these multiple
modalities, can produce outputs that account for textual cues, visual patterns, or even audio signals.
This leads to more holistic and accurate responses, whether the system is addressing complex
customer queries, supporting data-driven decision-making, or facilitating advanced research tasks.
By bridging different data types under one search and generation framework, multimodal AI systems
enable truly comprehensive insights and solutions in real time.</p>
    </sec>
    <sec id="sec-4">
      <title>4. State of the Art in RAG</title>
      <p>This area of work and research is highly new. In fact, we can only talk about a few years of intensive
work an in this area. Therefore, with a non-zero probability, despite the fact that I have used the latest
technologies and the most modern approaches, by the time you read this text, the rapid development
of technology may make this work outdated.</p>
      <p>
        Retrieval-Augmented Generation (RAG) represents a cutting-edge paradigm in natural language
processing (NLP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and information retrieval. It enhances traditional generative models by
integrating an external knowledge source – usually stored in a vector database – to provide
up-todate, context-specific answers. This approach not only mitigates the problem of “model
hallucination” (where the model fabricates facts) but also allows highly accurate, domain specific
responses [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <sec id="sec-4-1">
        <title>4.1. Early and Foundational Work</title>
        <p>Open-Domain Question Answering (QA): Early RAG-like techniques emerged in QA tasks, where a
language model would consult an indexed set of documents to find relevant passages before
generating an answer. Systems such
as DrQA (Facebook AI Research) and REALM (Google) experimented with retrieving textual chunks
to serve as context for generation 111.</p>
        <p>
          Dense Passage Retrieval [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] (DPR): This approach leverages dual encoders – one for the query and
one for the documents – to create vector embeddings that can be compared efficiently. DPR was
instrumental in showing how retrieval significantly improves end-to-end QA performance 222.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Modern RAG Architectures</title>
        <p>Contemporary RAG systems rely heavily on transformer-based architectures – like GPT or Claude –
combined with advanced retrieval mechanisms. Below I collected are some widely adopted designs
and best practices:</p>
        <p>Retriever-Generator Pipeline. This is one of the most commonly implemented workflows, where
the system is split into two distinct modules – a retriever and a generator.</p>
        <p>Retriever: Focuses on scanning a vector database (such as Chroma DB, Milvus, or Pinecone) to
find semantically relevant embeddings. Rather than just text, these embeddings may represent a wide
array of data modalities, including paragraphs from PDF documents, images extracted from product
catalogs, or even short audio transcriptions. The retriever’s primary responsibility is to convert the
user query into a vector and efficiently retrieve the top-k matches via Approximate Nearest Neighbor
(ANN) search.</p>
        <p>
          Generator: Once the retriever returns the most relevant items, the generator (often a
transformerbased Large Language Model like GPT-4 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], Claude [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], or a Llama-derivative) synthesizes a
coherent answer that references the retrieved context. This architectural split ensures the LLM has
direct, up-to-date knowledge from the vector store, minimizing the chance of hallucinations and
maximizing factual relevance [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>Iterative RAG (Re-Ranking or Rerun Retrieval). Iterative RAG pipelines build on the basic
retriever-generator model but introduce additional steps to refine or re-rank results.</p>
        <p>Multiple Iterations: In this design, the system retrieves an initial batch of top-k documents or
embeddings, produces a partial (or draft) response, and then uses that partial response to refine the
query. This might involve clarifying ambiguous terms, adding extra keywords, or removing less
relevant context. The retriever is invoked again, potentially returning a different set of more precise
results that better align with the refined query.</p>
        <p>Re-Ranking: Another technique is to apply a second module – often a crossencoder – for
reranking. The cross-encoder will compare each candidate document with the query more thoroughly,
improving the precision of the retrieved set. Such re-ranking is especially beneficial when tasks
demand high accuracy, such as medical or legal queries where small factual mistakes can be
detrimental.</p>
        <p>Hybrid Indexing. While many RAG systems rely purely on dense embeddings for semantic
similarity, some adopt a “hybrid” approach that blends sparse (keyword-based) indexing – like BM25
– with dense (vector-based) retrieval.</p>
        <p>Sparse + Dense: Keyword-based search is adept at matching exact terminology or acronyms,
while vector-based retrieval excels at capturing semantic relationships and synonyms. A hybrid
index can yield strong coverage for both literal matches and conceptual or paraphrased queries. Users
who rely on specific domain terms (such as medical codes, legal phrases, or brand names) benefit
from the sparse component, while the dense vectors handle broader, more contextoriented matching.</p>
        <p>In summary, modern RAG solutions are increasingly sophisticated, incorporating multiple layers
of retrieval refinement and indexing strategies to ensure precise, context-driven generation. The
Retriever-Generator Pipeline forms the core of most systems, but techniques such as iterative
retrieval (with re-ranking) and hybrid indexing open the door to higher accuracy, domain specificity,
and improved tolerance of ambiguous queries. These architectures highlight how pairing advanced
transform-based language models with well-structured retrieval mechanisms can deliver highly
relevant, up-to-date responses in diverse environments – from e-commerce recommendation engines
to deeply specialized medical knowledge bases.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Advanced Techniques for Multimodal RAG</title>
        <p>Despite the fact that this direction is one of the newest, I took the liberty of highlighting the following
two techniques that allow to achieve the most relevant and accurate answers on large databases with
mixed content types:</p>
        <p>Multimodal Embeddings: Instead of limiting the vector database to text embeddings, advanced
RAG systems store embeddings for images, videos, or audio. Approaches such as CLIP (by OpenAI) or
BLIP (Salesforce Research) enable semantically meaningful vector representations of images, which
can then be retrieved similarly to text 333.</p>
        <p>Cross-Modal Retrieval: Allows queries in one modality (e.g., text) to retrieve related content in
another modality (e.g., images). This opens up a variety of use cases like product search, visual QA,
and more.</p>
        <p>RAG has reshaped how we view language models, demonstrating that external context retrieval
can significantly boost factual accuracy and domain specialization. The move toward multimodality,
iterative refinement, and dynamic knowledge updates promises even more robust and versatile RAG
systems. As LLMs continue to scale and vector databases mature, RAG will likely remain the leading
paradigm for applications demanding both creative generation and factual grounding.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Large Language Models (LLMs) and Their Role in RAG</title>
      <p>Large Language Models (LLMs) form the backbone of Retrieval-Augmented Generation (RAG) by
turning retrieved context into coherent and contextually accurate answers. These models, trained on
massive text corpora, can generate human-like language, summarize complex information, or even
infer connections across topics. When combined with real-time retrieval from external data sources,
LLMs are significantly more powerful and reliable than when operating solely on their internal
parameters.</p>
      <p>Key Characteristics of LLMs in RAG:</p>
      <p>
        Pre-training on Vast Corpora. LLMs such as GPT-4 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] or Anthropic Claude [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] undergo
pretraining on billions of tokens collected from diverse sources, including books, websites, and
usergenerated content. This stage equips the model with a broad understanding of language structure and
semantic relationships. By absorbing patterns from this extensive corpus, LLMs develop a
generalized capability to respond to prompts on various topics.
      </p>
      <p>This extensive linguistic and factual grounding means that LLMs can be paired with different
types of domain-specific data during the retrieval phase, allowing them to handle a wide range of
queries – from technical troubleshooting to indepth market analysis.</p>
      <p>When new context is presented via retrieval, the model leverages its base knowledge while
integrating fresh, external content to respond with increased precision.</p>
      <p>Because the model already “knows” a great deal about language and real-world concepts,
augmenting it with real-time or specialized data can yield remarkably accurate results without
having to train a custom model from scratch.</p>
      <p>Context Window Constraints. Despite their extensive pre-training, LLMs maintain a finite
context window, often measured in tokens, that limits how much text they can “see” or process in a
single pass. For instance, a model may only accept a few thousand tokens of input before it cannot
ingest further material.</p>
      <p>Retrieving only the most pertinent data from a vector database is essential to ensure the final
prompt stays within the model’s context window. By filtering out extraneous or repetitive material,
RAG systems can present a curated set of chunks or passages that fit into this constrained input
space.</p>
      <p>This mechanism helps avoid cutting off essential details mid-prompt, which could confuse the
model or compromise the fidelity of the final answer.</p>
      <p>Architects of RAG solutions must plan for robust chunking and summarization strategies, as well
as top-k or top-n retrieval logic, to keep the input text concise yet maximally informative.</p>
      <p>Enhanced Accuracy and Reduced Hallucinations. LLMs, by their nature, can sometimes produce
fabricated or off-topic statements – commonly referred to as “hallucinations.” Such occurrences arise
when the model draws upon probabilistic patterns rather than referencing verified facts.</p>
      <p>Providing curated data from a vector store effectively grounds the model’s generation process.
Because the LLM is referencing real, contextually relevant text, it becomes significantly less prone to
inventing details that are not reflected in the source material.</p>
      <p>In knowledge-intensive domains (e.g., medical or legal fields), limiting the scope of possible
responses to high-quality, domain-vetted information can drastically reduce factual errors.</p>
      <p>This grounding not only bolsters correctness but also builds user trust, as the system can reference
and cite specific passages or metadata that informed its final answer.</p>
      <p>Flexibility and Domain Adaptation. LLMs offer a high degree of adaptability when supplemented
with domain-specific training or fine-tuning. By ingesting corpora in specialized fields – like finance,
law, or healthcare – an LLM can refine its approach to technical language, conventions, and idiomatic
usage within that domain.</p>
      <p>Developers can use incremental fine-tuning strategies on LLMs for specialized tasks, retaining the
general benefits of the model’s broad pre-training while honing its responses in narrower areas of
expertise.</p>
      <p>Reinforcement Learning from Human Feedback (RLHF) provides an additional layer of alignment,
ensuring that generated content adheres to professional standards and expert-reviewed guidelines.</p>
      <p>When integrated with retrieval, a model that has already been partially fine-tuned on relevant
topics will better synthesize newly retrieved context, achieving more coherent, domain-specific
outputs.</p>
      <p>Workflow in a RAG System:</p>
      <p>User Query: The user provides text prompts or multimodal data (e.g., images).</p>
      <p>Retrieval Stage: A vector database (e.g., Chroma DB) identifies top-k embeddings that match
the user’s query.</p>
      <p>Context Injection: These embeddings, typically chunks of text or encoded visuals, are passed
into the LLM’s context window.</p>
      <p>Generation: The LLM synthesizes a response that references the retrieved context, aiming for
concise, accurate, and up-to-date information.</p>
      <p>By grounding LLMs in context retrieved from vector databases, RAG frameworks harness the best
of both worlds: broad language fluency from massive pre-training and precise, domain-specific
knowledge from external data sources. This synergy addresses critical limitations of standalone
generative models and opens the door to reliable, versatile applications across business, research, and
beyond.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Multimodal Embeddings</title>
      <p>Multimodal embeddings extend the concept of text-only vector representations to include a variety of
data formats – images, audio, video, and even sensor data. By converting different data types into a
shared vector space, AI systems can perform cross-modal retrieval and reasoning, creating new
possibilities in applications such as visual search, audio-based information retrieval, and video
summarization.</p>
      <p>
        Models such as CLIP, Florence-2 and the lightweight Phi-3-Vision[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] demonstrate how
contrastive pre-training on paired image–text datasets can align visual and linguistic concepts so
well that a query expressed in prose can reliably retrieve a relevant image, and vice-versa[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Creating such a unified space requires dedicated encoders for each modality. Text is typically
embedded with transformer-based language models or sentence-BERT variants[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] that preserve
semantic similarity over long sequences. Images are mapped with convolutional backbones or vision
transformers that convert pixel grids into high-level feature vectors. Audio, which carries
information in both time and frequency domains, is often represented through spectrogram-based
CNNs or latent representations learned by wav2vec-style architectures. Once produced, the vectors
are normalised and projected to a common dimensionality so that cosine distance comparisons
become meaningful across modalities. Careful calibration—often enforced during training by shared
projection layers or temperature-scaled contrastive losses—ensures that a photograph of a violin and
the phrase “classical string instrument” occupy neighbouring regions of the embedding manifold.
      </p>
      <p>The most successful training recipes rely on contrastive learning, where paired examples are
pulled together in latent space while mismatched pairs are pushed apart. Large-scale resources such
as the LAION-2B image–text corpus or AudioSet’s labelled clips provide billions of positive–negative
pairs that drive this alignment. When explicit pairs are unavailable, self-supervised objectives,
including masked prediction or multimodal autoencoding, can still coax encoders toward a common
representation by forcing them to reconstruct missing content through complementary signals.</p>
      <p>After encoding, all vectors—whether they describe paragraphs, PNGs or waveforms—are written
to a high-dimensional index such as Chroma DB. This unified store allows a Retrieval-Augmented
Generation pipeline to execute a single approximate-nearest-neighbour search, retrieve the top-k
context chunks and inject them into a large language model prompt. In a retail catalogue, for
example, embeddings of product descriptions, user reviews and studio photographs coexist; a
customer who types “lightweight trail-running shoes with aggressive tread” can be served the most
relevant images even if the precise wording never appears in the metadata, while an uploaded shoe
photo can surface corresponding textual reviews.</p>
      <p>By collapsing heterogeneous data into one latent space, multimodal embeddings grant RAG
systems the ability to answer richer, genuinely cross-domain questions. They replace siloed search
pipelines with a coherent mechanism that respects semantic relationships across channels, ultimately
delivering faster, more accurate and more user-centric AI solutions in domains that range from
e-commerce and digital libraries to medical diagnostics and autonomous vehicles.</p>
    </sec>
    <sec id="sec-7">
      <title>7. High level architecture and designing RAG system</title>
      <sec id="sec-7-1">
        <title>7.1. User Interface</title>
        <p>The Retrieval-Augmented Generation (RAG) System is designed to seamlessly integrate real-time
data retrieval with powerful language modeling capabilities, enabling the production of contextually
relevant and multimodal responses for a variety of domains. At its core, this system combines a
vector database for storing and retrieving multimodal embeddings (text, images, etc.) with a Large
Language Model (LLM) to enrich user queries with up-to-date and domain-specific information.</p>
        <p>Much like an integrated platform in an educational environment, the RAG System functions as a
robust, scalable solution that prioritizes flexibility, accuracy, and simplicity for both administrators
and end users. By leveraging modern AI techniques, the system can handle diverse data (e.g., PDFs,
images, tables) and deliver intelligent responses tailored to the user’s query. This architecture focuses
on real-time retrieval, transparent communication with AWS-managed services, and a
wellorganized API layer that supports synchronous or asynchronous usage.</p>
        <p>
          The user-facing component of the RAG System is a lightweight FastAPI application that offers a
straightforward, intuitive interface to submit and manage queries. Administrators and advanced
users can interact with the system through various endpoints – either synchronously or by
delegating tasks to a worker Lambda function for asynchronous processing [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. This design not only
improves accessibility and ease of management but also enhances the user experience by
automatically distributing workload for resource-intensive queries (Fig 1).
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Backend and Core Logic</title>
        <p>At the heart of the system is the Python-based backend, that coordinates the entire
RetrievalAugmented Generation workflow by interfacing with multiple specialized modules. The process
begins with document ingestion, where PDFs, images, and table data are harvested, divided into
smaller chunks, and then embedded using a Bedrock-based model specifically tasked with generating
vector representations. During this phase, the system extracts textual content, table-like structures,
and any relevant images from PDF sources, relying on carefully designed chunk-splitting logic to
ensure each portion remains within manageable size limits. These chunks, once converted into
floating-point vectors, are written to the Chroma database, a vector store designed for rapid
similarity searches. By continually updating Chroma DB in this fashion, the system ensures that any
query can access the most recent semantic representations, thus reducing the risk of outdated
references and enabling context-rich retrieval.</p>
        <p>Once a user query arrives, the backend draws upon the LLM integration layer, connecting to a
large language model (such as Anthropic Claude) through AWS Bedrock. Before the model is called,
the backend retrieves top-k relevant embeddings from Chroma DB by comparing the user’s query
embedding to the stored vectors; these retrieved chunks are then merged into a context prompt
designed to guide the LLM toward an answer that remains faithful to the underlying data. The goal
here is twofold: to minimize hallucinations and to maintain high fidelity to the documents ingested
earlier. By injecting these top-k chunks directly into the prompt, the LLM receives immediate,
domain-specific knowledge that significantly boosts the coherence and accuracy of the generated
text.</p>
        <p>To keep track of every incoming request and subsequent answer, the backend leverages a
DynamoDB table that logs essential query details, including query text, source metadata, and the
model’s generated response. This design ensures the system can handle a spike in usage by
distributing state management across a serverless, NoSQL data store rather than relying on a single
node or a traditional relational database. Scalability becomes straightforward under high-traffic
scenarios, as DynamoDB scales seamlessly when writes and reads intensify, and each query record is
associated with a unique key to allow instantaneous lookups.</p>
        <p>For operations that demand significant processing time, the backend employs an asynchronous
architecture. When a query is likely to involve substantial parsing, large volumes of data, or extensive
generation time, the system transfers control to a worker Lambda function. This worker remains
responsible for finalizing RAG tasks – retrieving embeddings from Chroma DB, constructing the
LLM prompt, generating an answer, and recording the output back in DynamoDB.</p>
        <p>Because this worker executes independently of the front-end API, the user-facing Lambda returns
an immediate acknowledgment, thus sparing the user from lengthy waits and preventing potential
timeouts. By separating heavy-duty tasks from synchronous API calls, the backend remains
responsive even under large loads, and errors in one long-running query do not impede others (Fig 2).</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Deployment and Infrastructure</title>
        <p>
          The RAG System is containerized and deployed via AWS CDK (Cloud Development Kit) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
Through CDK, core components are defined as code, making them easy to maintain, version-control,
and replicate:
        </p>
        <p>• Lambda Functions: Two Docker-based Lambda functions – one for API handling (FastAPI)
and one for background (worker) tasks.</p>
        <p>• DynamoDB: Houses query metadata, enabling quick lookups of query status and results.
• Chroma Vector DB: Maintains vector embeddings for text, tables, and images.
• Amazon Bedrock: Provides the language model (Claude) and embedding services.
• IAM Roles: Enforce least-privilege access to ensure a secure environment. Through this
setup, the RAG System provides an adaptable, cloud-native platform for real-time information
retrieval, content generation, and robust data storage. Administrators can easily configure
environment variables (e.g., DynamoDB table names, Lambda handlers) and scale individual
components to match evolving demands, ensuring efficiency and reliability in various real-world
scenarios.</p>
      </sec>
      <sec id="sec-7-4">
        <title>7.4. Data Ingestion and Workflow</title>
        <p>Modern organizations often deal with a variety of data types – text, images, tables, etc. – within a
single PDF document. Our RAG pipeline addresses this challenge by breaking each file into
appropriately embedded chunks before storing them in Chroma DB. Below is an overview of this
ingestion flow:
1. File Detection and Reading. The system checks a source directory (e.g., src/data/source/) for
any PDF files. Each file discovered is opened by PyMuPDF (fitz), which iterates through its
pages.
2. Chunk Splitting and Metadata Tagging. After extraction, text-based content is split into
smaller parts to ensure each segment remains within suitable lengths for embedding.</p>
        <p>Throughout this process, metadata such as file source, page number, and content type (text,</p>
        <p>image, or table) is retained. A unique identifier is assigned so each chunk can be traced back to
its PDF origin.</p>
        <p>Vector Embedding. Each chunk (text or Base64-encoded image) is passed to a Bedrock-based
embedding function – defined in get_embedding_function.py - which converts it into a
floating-point vector. The pipeline can quickly handle semantic searches by storing both the
chunk and its vector.</p>
        <p>Database Persistence. Finally, the newly created or modified chunks, along with their
embeddings, enter Chroma DB. Before adding each chunk, the system checks its unique ID to
avoid duplicates. With this ingestion complete, the system is primed for RAG workflows,
ensuring that text, images, and tables are immediately accessible for similarity searching and
context generation.</p>
        <sec id="sec-7-4-1">
          <title>Key moments:</title>
          <p>

</p>
          <p>Comprehensive Extraction: Captures text, images, and basic table data in one pass.
Automatic Chunking: Smaller text pieces improve both retrieval and model context.
Consistent Metadata: Each extracted piece is coupled with source info, enabling accurate
traceability (Fig 3).</p>
          <p>
            Once the PDF ingestion and embedding steps are complete, users can query the system to retrieve
relevant information. Below is the typical flow that starts with a user request and concludes with a
context-enhanced answer:
1. User Submits Query. The user sends a query (e.g., “Show me info about your services”) to the
FastAPI endpoint. A QueryModel object is created, storing essential fields like query_id,
query_text, and create_time, and is immediately placed into DynamoDB with a status of “in
progress.”
2. Retrieval Step (Vector Database Lookup). The query text is embedded using the Bedrock [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
embeddings model, then a similarity search is performed in Chroma DB to locate the top-k
relevant chunks (text, tables, images). These chunks might include PDF excerpts,
base64encoded images, or table data pertinent to the user’s query.
3. Prompt Assembly and Model Invocation. The retrieved chunks are grouped by type (text,
table, image) and combined with the user’s question to form a prompt template. This prompt
          </p>
          <p>is sent to the LLM (Anthropic Claude or a similar model via AWS Bedrock), which generates a
contextaware answer.</p>
          <p>Response Creation and Storage. The model’s response is added to the QueryModel record in
DynamoDB, along with IDs of the documents that were used as context. The query’s status is
updated to “complete,” indicating that a valid answer is ready.</p>
          <p>User Retrieval of Answer. The user or client application can retrieve the completed answer by
querying DynamoDB or calling the FastAPI endpoint (GET /get_query?query_id=...). The
returned data includes the AIgenerated answer plus any relevant source metadata.</p>
          <p>One additional idea for this project, not yet implemented due to limited time, involves a separate
caching service that would deliver results for very similar queries directly from a database. This could
lessen the load on the LLM but presents its own challenges.</p>
          <p>The RAG System operates with two primary data storage mechanisms, each fulfilling distinct
roles:</p>
          <p>Chroma Vector Database. Stores high-dimensional embeddings derived from text, images,
and tables for rapid similarity search. Facilitates real-time retrieval of context relevant to user
queries.</p>
          <p>DynamoDB (Metadata and Query State Management). Houses metadata such as query states,
final answers, and references to relevant source chunks.</p>
          <p>Ensures persistence and fast lookups for user queries, enabling both synchronous and
asynchronous processing.</p>
          <p>By separating vector storage from query and state metadata, the system can efficiently handle
large-scale retrieval tasks while maintaining a minimal, consistent record of query progress and
results.</p>
          <p>Chroma DB is designed specifically for storing and searching large sets of embeddings
(highdimensional vectors). In the RAG System, every chunk of extracted text, table data, or
base64encoded image content is converted into vectors using a Bedrock-based embedding function. These
vectors are then persisted in a Chroma index to enable efficient Approximate Nearest Neighbor
(ANN) lookups.</p>
          <p>Each chunk inserted into Chroma DB typically consists of:</p>
          <p>Embedding Vector: A numeric array representing the semantic content of the chunk.
Page Content: Original extracted text or encoded media (e.g., base64encoded image).
Metadata Dictionary: Key-value pairs capturing:
source: Filename or document source identifier (e.g., my_document.pdf).
page: Page number or location reference in the original file.
type: Classification of the chunk (e.g., text, table, image).</p>
          <p>id: Unique identifier for the chunk (e.g., a hash of the chunk’s content)







}</p>
          <p>}
Example (Conceptual):
{
"embedding": [0.234, -0.091, ... , 0.778],
"page_content": "Base64 Encoded Data or Extracted Text",
"metadata": {
"source": "data/source/document.pdf",
"page": 5,
"type": "image",
"id": "my_document.pdf:5:image:123456789"</p>
          <p>Although Chroma DB doesn’t enforce a rigid “schema” in the same way a relational database does,
it provides a structured approach to indexing embeddings. By default, it maintains:


</p>
        </sec>
        <sec id="sec-7-4-2">
          <title>ID Field: A unique document identifier in the vector store.</title>
          <p>Embedding Field: Stores the vector representation for each document chunk.</p>
          <p>Metadata Field: An opaque JSON-like structure for any additional data relevant to retrieval.</p>
          <p>While Chroma DB is optimized for vector similarity search, DynamoDB (NoSQL) is used to track
user queries, record AI-generated answers, and maintain the overall workflow state. This separation
ensures that the system can quickly serve query metadata – such as whether a query is complete –
without scanning a large embedding store. The primary key is a query_id, representing a unique user
query or session (Table 1).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions</title>
      <sec id="sec-8-1">
        <title>Description</title>
      </sec>
      <sec id="sec-8-2">
        <title>Primary key for the query.</title>
      </sec>
      <sec id="sec-8-3">
        <title>Unix timestamp indicating when the query was created.</title>
      </sec>
      <sec id="sec-8-4">
        <title>Original user input text.</title>
      </sec>
      <sec id="sec-8-5">
        <title>AI-generated answer.</title>
      </sec>
      <sec id="sec-8-6">
        <title>List of relevant chunk IDs from Chroma DB.</title>
      </sec>
      <sec id="sec-8-7">
        <title>Indicates whether query processing is finished</title>
        <p>(true/false).</p>
        <p>This study presents an asynchronous, web-based RAG system that integrates multimodal inputs to
enhance information retrieval and generation in various contexts. The system is developed in Python
and hosted on AWS, combining Chroma DB for vector storage and Anthropic’s Claude-3-Haiku LLM
model accessible via AWS Bedrock. By leveraging modern cloud capabilities, the solution scales
efficiently and handles diverse data modalities in real time.</p>
        <p>The challenges confronted during the development of this RAG System—ranging from LLM
adaptation and vector indexing intricacies to robust asynchronous orchestration—have yielded
significant insights. These lessons, rooted in both practical experimentation and a deep personal
engagement with the topic, shape a roadmap for further enhancements. Ultimately, the confluence of
Large Language Models, intelligent retrieval, and scalable cloud infrastructure heralds a new frontier
in AI applications, one where factual grounding, multimodal understanding, and on-demand
adaptability converge to transform how information is accessed and consumed.</p>
        <p>In sum, the project underscored the necessity of interdisciplinary thinking— combining the rigor
of software engineering, the depth of modern NLP research, and the architectural elegance of
serverless cloud design. Our eagerness to expand upon these foundations is unwavering, as each
challenge fosters a more robust, visionary perspective on the role of AI in orchestrating dynamic,
context-aware intelligence across diverse domains.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Citations and bibliographies</title>
      <p>
</p>
      <p>
        An informally published works[
        <xref ref-type="bibr" rid="ref1 ref11 ref14 ref15 ref16 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 14, 15, 16</xref>
        ].
      </p>
      <p>
        An online document/World Wide Web resources[
        <xref ref-type="bibr" rid="ref10 ref12 ref13">10, 12, 13</xref>
        ].
      </p>
      <p>During the preparation of this work, the author used ChatGPT-4 in order to: check grammar and
verify factual accuracy of gathered information. After using these tools/services, the author reviewed
and edited the content as needed and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al. (
          <year>2020</year>
          ).
          <article-title>Language Models are Few-Shot Learners</article-title>
          . arXiv:
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al. (
          <year>2021</year>
          ).
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          . arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Mialon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , et al. (
          <year>2023</year>
          ).
          <article-title>Augmenting LLMs with External Knowledge: A Survey of RetrievalAugmented Generation (RAG)</article-title>
          .
          <source>arXiv:2306</source>
          .
          <fpage>13931</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Karpukhin</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oguz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , et al. (
          <year>2020</year>
          ).
          <article-title>Dense Passage Retrieval for OpenDomain Question Answering</article-title>
          . arXiv:
          <year>2004</year>
          .04906.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          , et al. (
          <year>2021</year>
          ).
          <article-title>Learning Transferable Visual Models from Natural Language Supervision (CLIP)</article-title>
          .
          <source>arXiv:2103</source>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al. (
          <year>2022</year>
          ).
          <article-title>BLIP: Bootstrapping Language-Image Pre-training for Unified VisionLanguage Understanding and Generation</article-title>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>12086</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERTNetworks</article-title>
          . arXiv:
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al. (
          <year>2021</year>
          ).
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          . arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>OpenAI.</surname>
          </string-name>
          (
          <year>2023</year>
          ). GPT-4
          <source>Technical Report. arXiv:2303</source>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Anthropic. Claude</given-names>
            <surname>Documentation</surname>
          </string-name>
          . Retrieved from: https://docs.anthropic.com/en/docs/welcome
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al. (
          <year>2019</year>
          ).
          <article-title>Billion-scale similarity search with GPUs</article-title>
          .
          <source>IEEE Transactions on Big Data. arXiv:1702.08734</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>[12] FastAPI https://fastapi.tiangolo.com/</mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>AWS. Bedrock AI Service</surname>
          </string-name>
          <article-title>Documentation</article-title>
          . Retrieved from: https://aws.amazon.com/documentation-overview/bedrock/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <article-title>AWS</article-title>
          .
          <article-title>CDK documentation</article-title>
          . Retrieved from: https://aws.amazon.com/cdk/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , Ł., &amp;
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Attention is all you need</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <volume>30</volume>
          . https://arxiv.org/abs/1706.03762
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al. (
          <year>2020</year>
          ).
          <article-title>Transformers: State-of-the-Art Natural Language Processing</article-title>
          . arXiv:
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>