<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Generation (RAG) Pipeline for GI-Cancer Prediction and Classification Using Quantized Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arti Jha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nidhi Shah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vikas Kumawat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <email>yash@pilani.bits-pilani.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept of Computer Science and Information System, Birla Institute of Technology and Science</institution>
          ,
          <addr-line>Pilani</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Retrieval-Augmented Generation (RAG), Cancer Prediction, Quantized Large Language Models</institution>
          ,
          <addr-line>Question An-</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>12</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents the implementation of a cancer detection chatbot utilizing a Retrieval-Augmented Generation (RAG) pipeline integrated with Large Language Models (LLMs). The system aims to improve the accuracy and relevance of cancer prediction and classification by leveraging comprehensive text-based medical data related to Gastrointestinal (GI) Cancer, including symptoms categorized by age, gender, stage, and type. The chatbot uses MiniLM L6 v2 for text embedding generation and GPT-3.5 turbo for query response generation, which is stored and retrieved using FAISS as the vector database. A comprehensive comparison of the quantized and non-quantized (Bio-Mistral and GPT 3.5 turbo) for response generation has been presented. The architecture, methodologies, and evaluation metrics used to assess the chatbot's performance are discussed alongside a literature review highlighting advancements in RAG and LLM applications in healthcare, emphasizing this work's significance in cancer diagnosis.</p>
      </abstract>
      <kwd-group>
        <kwd>Quantized</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The integration of artificial intelligence (AI) into medical diagnostics, particularly for cancer prediction
and classification, has the potential to revolutionize healthcare delivery. Cancer diagnosis traditionally
involves the analysis of diverse text-based data such as clinical notes and patient histories. This paper
presents a Retrieval-Augmented Generation (RAG) framework combined with Large Language Models
(LLMs) to develop a sophisticated chatbot aimed at assisting clinicians in cancer diagnosis. The chatbot
focuses on processing and synthesizing text-based medical data to enhance diagnostic accuracy and
relevance.</p>
      <p>Contextual Understanding and Generation: Retrieved data is fed into the LLM (e.g., GPT-3.5 turbo),
which synthesizes the information into accurate, contextually relevant diagnostic outputs. The LLM
leverages its transformer architecture to understand relationships between diferent data inputs and
generate detailed diagnostic reports.</p>
      <p>Real-Time Decision Support: The chatbot serves as a real-time decision support tool, continuously
updating its knowledge base with the latest medical research, thus aligning its recommendations with
current standards of care.</p>
      <p>Clinical Eficiency and Personalization</p>
      <p>: By automating data retrieval and synthesis, the chatbot
reduces the cognitive load on clinicians, enhancing diagnostic eficiency. The system also personalizes
responses based on specific patient data and clinician preferences, ensuring relevance and applicability.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <sec id="sec-1-1">
        <title>1.1. Background and Motivation</title>
        <p>Recent advancements in natural language processing (NLP) and AI have enabled the development of
sophisticated medical chatbots capable of supporting healthcare professionals in various capacities. The
COVID-19 pandemic has accelerated the adoption of AI-driven solutions, particularly in contexts where
rapid dissemination of information is critical. The success of these AI systems in managing infectious
diseases has prompted further exploration into their applicability in other medical domains, such as
oncology. The challenge, however, lies in the efective integration of diverse data sources into a unified
diagnostic framework. This paper addresses this challenge by presenting a RAG-based approach, shown
in figure 1, that leverages the power of LLMs to enhance cancer prediction and classification.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The work discusses the development and evaluation of GastroBot [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a chatbot [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] designed for
gastrointestinal disease inquiries, utilizing Retrieval-Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] technology to
enhance response accuracy and relevance. The study reports high System Usability Scale (SUS) scores for
GastroBot, indicating superior safety, usability, and smoothness compared to other models. It highlights
the importance of integrating external knowledge sources such as chatGPT to address challenges in
clinical applications of large language models. A significant amount of prior research has focused on
developing architectures that enhance systems with non-parametric memory, which are trained from the
ground up for particular tasks, such as memory and stack-augmented network, and memory layers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In contrast, this approach investigates a scenario where both parametric and non-parametric memory
components are pre-trained and pre-loaded with extensive knowledge. Importantly, by utilizing
pretrained access mechanisms, the system can access this knowledge without requiring additional training.
This work presents Retrieval Under Graph-Guided Explainable disease Distinction (RUGGED) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a
computational workflow that integrates Large Language Models with Retrieval Augmented Generation
to enhance biomedical hypothesis generation. It utilizes a comprehensive knowledge graph enriched
with data from various biomedical sources to provide explainable and actionable predictions. The
system aids to facilitate researchers in exploring complex biomedical questions efectively and gave to
specific answer to them.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed cancer detection chatbot is built upon a sophisticated RAG pipeline integrated with LLMs,
designed to process and synthesize text-based medical data into coherent diagnostic insights.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Collection and Preprocessing</title>
        <p>The dataset used in this study includes comprehensive text-based records from Electronic Health
Records (EHRs) related to Gastrointestinal (GI) Cancer. The data collection process involved sourcing
relevant medical records, ensuring the inclusion of diverse and representative samples of cancer
cases, and focusing on text data that describe symptoms, diagnoses, and patient histories. The data
collection process involved sourcing relevant medical records from established databases and ensuring
the inclusion of diverse and representative samples.</p>
        <p>Each dataset undergoes a rigorous preprocessing phase tailored to its specific data type. EHRs
are tokenized and parsed to extract key medical concepts, with particular attention given to patient
history, symptoms, and previous diagnoses. Imaging data is preprocessed using standard normalization
techniques and segmented into regions of interest, focusing on areas indicative of cancerous growths.
Genomic data is processed to highlight relevant biomarkers and genetic mutations, which are then
encoded into vectors for eficient retrieval.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Retrieval Mechanisms</title>
        <p>This work utilizes dense retrieval for the text-based dataset, using MiniLM L6 v2, a smaller, fine-tuned
version of BERT Large optimized for text embedding generation and sentence similarity tasks. The data
is stored and retrieved using the FAISS vector database, which facilitates eficient and accurate retrieval
of relevant information in response to user queries.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Integration with Large Language Models(LLM)</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Quantized LLM</title>
          <p>The conversational chatbot is initially powered by the Bio-Mistral Quantized LLM, a 5GB model
developed by Mistral AI. This quantized model facilitates enhanced extensibility and interoperability
across diverse computing environments and versions, making it feasible to deploy on low-power CPUs
and GPUs. Although the Bio-Mistral model demonstrated significant computational eficiency and
rapid response times, its accuracy was suboptimal. Consequently, to improve the accuracy of responses,
the system was subsequently upgraded to utilize the non-quantized GPT-3.5 turbo LLM, which provides
superior precision in natural language generation and processing.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Non-Quantized LLM</title>
          <p>Once the relevant text data is retrieved using FAISS, it is passed to GPT-3.5 turbo, which generates the
ifnal diagnostic outputs. GPT-3.5 turbo synthesizes the retrieved data into accurate, contextually relevant
diagnostic insights, leveraging its advanced language understanding capabilities. The transformer
layers within the LLM capture the intricate relationships between the various data inputs and the query
context, enabling the generation of diagnostic reports that are both accurate and aligned with clinical
best practices.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Methodology</title>
      <p>To rigorously evaluate the performance of the RAG-based chatbot in cancer prediction and classification,
we implemented a comprehensive set of experiments designed to assess the impact of diferent retrieval
strategies on key diagnostic metrics. The experiments were conducted on a meticulously curated
text-based dataset comprising patient symptoms and corresponding cancer diagnoses sourced from
Electronic Health Records (EHRs). This dataset provides a representative sample of GI Cancer cases,
enabling the chatbot to match symptoms with potential cancer diagnoses. The dataset comprises over
500 symptoms of cancer cases for diagnosis. The chatbot is designed to interact with users by asking
for symptoms and then responding with diagnostic insights or recommendations based on the retrieved
data. This dataset helped with the purpose of fine-tuning, whereas the dataset provided for the FIRE
2024 was used to evaluate the model with diferent LLMs. The testing dataset consisted of over 50
questions of symptoms related to GI cancer.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>The dataset, consisting of patient symptoms and corresponding medical records, was divided into
training and testing subsets, with 80% allocated for training and 20% reserved for testing, and table 2
shows the relevance scores of diferent models based on test data. This division ensured that the testing
data remained unseen during the training phase, allowing for an objective evaluation of the model’s
generalization capabilities. Cross-validation was done across diferent data subsets to further validate
the robustness of the results.</p>
        <p>The relevant data chunks are first extracted from the dataset using Splittext. The retrieved data was
then processed by MiniLM to generate embeddings, which were stored in the FAISS vector database.
The query from the user then gets converted to embeddings and is compared with data chunks to find
relevant chunks from the database. Both query and relevant chunks are then sent to the LLM, which
generates diagnostic outputs or recommendations that are subsequently evaluated against established
ground truth labels; refer to figure 2. This experimental setup enabled a thorough comparison of
each LLM’s response (BioMistral and GPT), focusing specifically on how well the chatbot could match
symptoms to potential cancer diagnoses.</p>
        <p>The experimental results revealed substantial insights into the efectiveness of the RAG-based chatbot
in cancer prediction and classification, particularly in its ability to match symptoms provided by the
user to potential cancer diagnoses and provide appropriate recommendations.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Performance Comparison between Bio-Mistral Quantized LLM and GPT-3.5 turbo</title>
        <p>The table compares the performance metrics between Bio-Mistral Quantized LLM and GPT-3.5 turbo,
highlighting significant improvements in key areas. GPT-3.5 turbo demonstrates a 10.15% increase
in accuracy, showing its superior precision over Bio-Mistral. Additionally, GPT-3.5 turbo reduces the</p>
        <sec id="sec-4-2-1">
          <title>Metric</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Hallucination Rate</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>Missing Rate</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>AlignScore</title>
        </sec>
        <sec id="sec-4-2-6">
          <title>Semantic Similarity AI-generated errors</title>
          <p>hallucination rate by 4.76%, indicating fewer instances of incorrect information generation. The missing
rate, which refers to the omission of relevant content, is lowered by 3.43%. In terms of alignment,
GPT-3.5 turbo achieves a 3.6% improvement in AlignScore, suggesting better consistency with expected
outputs. The model also enhances semantic understanding, as reflected by a 10.01% improvement in
semantic similarity. Lastly, GPT-3.5 turbo significantly reduces AI-generated errors by 20.17%, further
underscoring its enhanced reliability compared to Bio-Mistral Quantized LLM; refer to table 1.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Relevance of Generated Outputs</title>
        <p>The relevance of the outputs generated by GPT-3.5 turbo (non-quantized) significantly outperformed
those produced by the Bio-Mistral (quantized) model; refer to table 2. Healthcare professionals rated
GPT-3.5 turbo’s relevance score at an average of 9.0 out of 10, compared to Bio-Mistral’s 7.6 out of 10.
This indicates that GPT-3.5 turbo’s diagnostic suggestions were more closely aligned with clinical best
practices and provided more actionable insights for patient care, especially in responding to symptoms
described by users, showcasing the superiority of GPT-3.5 turbo in generating contextually relevant
outputs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The experimental results presented in this paper demonstrate the efectiveness of a Retrieval-Augmented
Generation (RAG) pipeline integrated with Large Language Models (LLMs) for cancer prediction and
classification, specifically for Gastrointestinal (GI) cancers. The hybrid retrieval approach significantly
enhances the diagnostic accuracy and relevance of the chatbot, outperforming both sparse and dense
retrieval methods used individually. By leveraging comprehensive text-based medical data, the system
improves the alignment of generated diagnostic insights with clinical best practices, as shown by the
substantial improvements in metrics such as accuracy, semantic similarity, and AI-generated errors
when using GPT-3.5 turbo compared to the Bio-Mistral Quantized LLM.</p>
      <p>Moreover, the relevance of the generated outputs was rated significantly higher when utilizing
GPT-3.5 turbo, further emphasizing the value of integrating advanced LLMs in medical diagnosis.
These improvements highlight the chatbot’s potential as a reliable decision-support tool for healthcare
professionals, capable of synthesizing patient symptoms with medical data to ofer actionable insights.</p>
      <p>Future work will focus on refining the retrieval mechanisms and incorporating real-time data
integration, with the aim of developing a fully integrated decision-support system to further support
cancer diagnosis and treatment. The promising results from this study underscore the importance of
combining advanced retrieval techniques with powerful language models to improve the accuracy and
clinical relevance of diagnostic tools in the healthcare domain.
During the preparation of this work, the author(s) used ChatGPT and Grammarly for grammar and
spelling checks, as well as for paraphrasing and rewording. The author(s) also experimented with
GPT-3.5 and a quantized version of BioMistral as part of the model development process. All content
was reviewed and edited by the author(s), who take full responsibility for the final content of this
publication.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Gastrobot: a chinese gastrointestinal disease chatbot based on the retrieval-augmented generation</article-title>
          ,
          <source>Frontiers in Medicine</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <article-title>1392555</article-title>
          . doi:
          <volume>10</volume>
          .3389/fmed.
          <year>2024</year>
          .
          <volume>1392555</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>A survey of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.18223</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv. 2303.18223.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Improving medical reasoning through retrieval and selfreflection with retrieval-augmented large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.15269</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2401.15269.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Redfield</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Natural questions: a benchmark for question answering research, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>453</fpage>
          -
          <lpage>466</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Pelletier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramirez</surname>
          </string-name>
          , I. Adam,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Steinecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ping</surname>
          </string-name>
          ,
          <article-title>Explainable biomedical hypothesis generation via retrieval augmented generation enabled large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2407.12888</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2407.12888.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>