Enhancing Biomedical Question Answering with Parameter-Efficient Fine-Tuning and Hierarchical Retrieval Augmented Generation Yichen Gao1,† , Licheng Zong1,† and Yu Li1,2,* 1 The Chinese University of Hong Kong, Hong Kong SAR, 999077, China 2 The CUHK Shenzhen Research Institute, Hi-Tech Park, Shenzhen, 518057, China Abstract This paper reports the work done by the CUHK-AIH team in the 12th BioASQ Challenge task 12b, which involves Phases A, A+, and B. In Phase A, we build BM25 indexes for all documents from PubMed Central (PMC). When an input question is received, the system uses the question as a keyword to retrieve relevant documents from PMC via the BM25 retriever, obtaining a list of targeted documents. For Phase A+, we construct a hierarchical Retrieval-Augmented Generation (RAG) pipeline based on the Llama2-chat-7B model. The model is fine-tuned on the BioASQ training set using a Parameter-efficient Fine-Tuning (PEFT) method called Low-Rank Adaptation (LoRA). The system further refines the search results from Phase A by employing an ensemble retriever that combines sparse and dense retrievers to identify the most relevant chunks. Finally, the system feeds the question and the most relevant chunks into the base model to generate the answer using appropriate prompts. In Phase B, the answer generation pipeline is similar to Phase A+, with the main difference being that we directly build indexes for the questions and their relevant snippets, treating snippets as the basic retrieval unit. We conducted detailed ablation studies and analyses on the model types and retrieval techniques, which indicate that PEFT and RAG can significantly improve the performance in biomedical Question Answering (QA) tasks. Keywords Biomedical Question-Answering, Retrieval-Augmented Generation, Large Language Model, Parameter-Efficient Fine-Tuning, BioASQ 1. Introduction In recent years, Large Language Models (LLM) have undergone significant development and have gained widespread adoption [1, 2, 3, 4, 5], particularly in the biomedical domain including medical informatics [6], medical imaging [7], and bioinformatics [8, 9, 10]. To address the requirements of healthcare professionals and medical education, researchers have embarked on investigating the utilization of Large Language Models (LLM) for medical knowledge queries. Their objective is to integrate intricate medical knowledge into LLMs, thereby providing a knowledge query system with more accurate and comprehensive medical information. Techniques like In-context Learning [11] and Retrieval Augmented Generation (RAG) [12] have been proposed, which introduces a novel paradigm for tackling knowledge queries and responses within specialized domains. In the field of medical information retrieval, a critical issue lies in constructing a robust information retrieval system to effectively manage a massive corpus of medical literature [13]. BioASQ [14] presents a platform to address this challenge, encouraging researchers to develop better intelligent retrieval systems. This paper focuses on task 12b [15], particularly the information retrieval, and the question- answering tasks, and tackles Phase A, Phase A+, and Phase B. Phase A involves finding relevant documents or snippets to a biomedical question. Phase B provides biomedical inquiries alongside relevant snippets (one or several sentences) and tasks participants with generating either the exact or CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ ycgao23@cse.cuhk.edu.hk (Y. Gao); lczong21@cse.cuhk.edu.hk (L. Zong); liyu@cse.cuhk.edu.hk (Y. Li) € https://lczong.com/ (L. Zong); https://liyu95.com/ (Y. Li)  0000-0001-5418-005X (L. Zong); 0000-0002-3664-6722 (Y. Li) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings ideal answers based on these snippets. Phase A+, which is a new phase this year, will directly measure the performance of the system while providing only the biomedical question. It’s worth noting that in this work, we only focus on the ideal answer part and will not be involved in the evaluation of the exact answer. Here we utilize the BM25 retriever to search documents in Phase A, and the Parameter-Efficient Fine-Tuning on the BioASQ corpus [16] to generate high-quality answers in Phase B. In Phase A+, we propose a hierarchical Retrieval-Augmented Generation pipeline combined with the mentioned BM25 retriever and PEFT for searching and generation. So, our system is called Corpus PEFT Searching (CPS). For convenience, we defined the systems we used in Phase A, B, and A+ as CPS-A, CPS-B, and CPS-A+ respectively. By illustrating the results of our system and conducting detailed ablation studies, we conclude that the PEFT and hierarchical RAG bring significant improvement on the model performance. 2. Related Work Since this paper can be divided into pipeline-related research and retrieval-related research, this section will outline two parts of related work in Section 2.1 and Section 2.2. Specifically, we will begin by sorting out the related work of the main pipeline, which consists of LLMs in medical informatics, parameter- efficient model fine-tuning, and retrieval augmented generation. After that, we will introduce the structure of the retrieved text in RAG, followed by a work that converts the retrieved text composition. 2.1. Construction of Pipeline: LLM, PEFT, and RAG 2.1.1. Large Language Models (LLMs) in Medical Informatics In recent years, the advancement of Large Language Models (LLMs), exemplified by GPT[1], BERT[2] and Llama[3, 4], alongside the abundant public medical text data from platforms like PubMed1 , has encouraged researchers to explore diverse applications of LLMs in medical informatics. Agrawal et al. [11] reveals that sizable language models, even without explicitly training for medical fields, exhibit the capacity to extract clinical information in few-shot settings. The latest iteration, Med-PaLM2 [6], fine-tuned from Flan-PaLM [17], attained an accuracy of 0.865 on the MedQA [18] benchmark designed for generating biomedical LLMs. In addition, Venigalla et al. [19] further conducted full-parameter training of the model using data from PubMed, enhancing the model’s capabilities in medical question- answering. Despite the apparent complexity of training models with a substantial number of parameters, the application of Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low-Rank Adaptation (LoRA) [11] has empowered researchers to efficiently fine-tune various models both in English and Chinese medical context [20, 21, 22, 23], including our work. 2.1.2. Parameter-Efficient Fine-Tuning (PEFT) Parameter-Efficient Fine-Tuning (PEFT) techniques encompass a series of fast, effective, and low GPU computing and memory demand model fine-tuning methods. They are widely applied in the research of Large Language Models, especially in specific domain applications [22, 23]. Adapter tuning [24] involves introducing small neural network modules (adapters) into the Transformer model and using a bottleneck architecture to compress and restore the original feature vectors, adapting the language model to new tasks. Prefix Tuning [25], proposed by X. L. Li and P. Liang, arranges a series of trainable prefixes for each Transformer layer to improve performance in specific domains. Meanwhile, by freezing the original matrix and approximating the parameter updates through low-rank decomposition, Low-Rank Adaptation (LoRA) [11] fine-tunes models with efficient memory and storage usage. 1 https://pubmed.ncbi.nlm.nih.gov 2.1.3. Retrieval Augmented Generation (RAG) To supplement the knowledge required by large language models when handling knowledge-intensive tasks, the system can retrieve information from large corpora like Wikipedia to generate prompt information [12]. The retrieval process can be performed using sparse retrieval, utilizing methods such as BM25 [26] or through dense retrieval [27] using vector similarity based on embedding models like Bge [28]. To capture text relationships and knowledge across multiple documents, recent work [29, 30, 31] treats multiple documents as a graph and conducts subgraph retrieval within the graph. There are also applications and frameworks, such as ChatPDF 2 and LangChain 3 , parsing large segments of prompt information for question-answering systems by inputting and summarizing entire documents in segmented fashion, thereby augmenting knowledge. In recent years, RAG also gains much attention in the biomedical field. MedCPT [32] is proposed to generate better biomedical sentence representations for zero-shot semantic information retrieval. It consists of a pair of Transformer-based retriever and re-ranker pre-trained on PubMed user click logs by contrastive learning. Xiong et al. [33] proposed MIRAGE and MEDRAG to conduct systematic and large-scale experiments in the field of medical RAG, which shows that RAG based on biomedical corpus can improve the performance of GPT-3.5 and Mixtral to the level of GPT-4. 2.2. Retrieval Units: Chunk and Snippet 2.2.1. Chunk: Generally Used Basic Unit of Retrieval Recently, numerous studies have focused on tasks within the medical domain, enhancing the capacity of Large Language Models (LLMs) to tackle Question Answering (QA) tasks related to medical knowledge through the application of Retrieval Augmented Generation (RAG) on biomedical databases. In these studies, due to the substantial size of their databases and the absence of structured organization or sentence extraction methods for Question Answering inquiries, the entirety of text or chunks is employed as the basic unit of retrieval. GeneGPT[34] obtains knowledge for LLM in the genetic field by calling the web API provided by the National Center for Biotechnology Information (NCBI). When GeneGPT organizes this knowledge, it uses the Codex [35] model with long context length (8k tokens) to handle structured API return results. Based on the Chain of Thought [36], the problem that requires multiple returns from the API is decomposed into multiple sub-problems. The RAG pipeline that uses a vectorized database, such as Chat-Orthopedist [37], splits the complete document in the database into chunks with the same length, which has a suitable length for retrieval and input to LLM. In addition, for RAG tasks in non-medical fields, approaches dividing documents into chunks are also very typical. Usually, vector databases such as chromaDB 4 and Pinecone 5 are employed as the basis for construction. 2.2.2. Snippet: Basic Unit of Retrieval in BioASQ Besides the exemplary medical knowledge Question Answering Large Language Model (LLM) pipelines, which stand out in the research domain, we also examined other systems featured in the BioASQ task 11b’s publication (CLEF-WN 2023 [38]). These systems primarily utilize the provided snippets directly, but there are some slight variations in specific applications. UR-gpt [39], which performs very well in the rankings, inserts snippets directly into prompts for in-context learning. The Sentence-based Ranking [40] system’s method is close to the previous system, but they used the top 5 most relevant sentences due to the input length limit of its model. The IISR [41] firstly selects the top 5 most relevant snippets and uses ChatGPT to truncate or summarize them according to a fixed length. The context is input as the ChatGPT conversation history instead of fusing it together with the question in IISR. 2 https://www.chatpdf.com 3 https://www.langchain.com/ 4 https://www.trychroma.com/ 5 https://www.pinecone.io/ 3. Methodology This section details the pipelines we employed in each phase. Our system is called Corpus PEFT Searching (CPS), for our main contribution is to integrate Parameter-Efficient Fine-Tuning (PEFT) on the PMC corpus with a hierarchical retrieval-based searching method in the BioASQ 12b task. Since the inputs and targeted outputs for submission are different in the three phases, we built three pipelines named CPS-A, CPS-B, and CPS-A+ for Phase A, B, and A+ submissions respectively. Table 1 shows the systems used in different phases and the sections and figures that describe them. In Phase A (Section 3.1) we leveraged the BM25 retriever to search relevant documents and return the document list. Then in Phase B (Section 3.2), we used the BioASQ training set as the corpus to fine-tune a Large Language Model (LLM) based on a Parameter-Efficient Fine-Tuning (PEFT) method to generate more reasonable answers. Finally, in Phase A+ (Section 3.3), we proposed a hierarchical Retrieval-Augmented Generation (RAG) method combined with the methods mentioned above to construct a general biomedical Q&A system, which could serve as a chatbot for any other biomedical questions. Table 1 The systems used in different phases with corresponding sections and figures that describe them. System Section Figure Phase CPS-A Section 3.1 Figure 1 Phase A CPS-B Section 3.2 Figure 2 Phase B CPS-A+ Section 3.3 Figure 3 Phase A+ 3.1. Phase A Pyserini Paper Abstracts from Abstract Indexes Query BM25 PubMed Central Retriever Question Relevant Document List Doc 1 Doc 1 ...... Doc i Figure 1: Overview of system CPS-A (Corpus PEFT Searching - phase A), the retrieval pipeline for BioASQ task 12b phase A. As illustrated in Figure 1, in Phase A, we first downloaded all the abstracts available from PubMed Central (PMC)6 and employed the package Pyserini[42] to build BM25 indexes of these abstracts, treating each abstract as an individual document. Subsequently, we applied the BM25 retriever to search related documents using the query questions as the keywords. After being filtered by an appropriate similarity score threshold, the final list of relevant articles is obtained. 3.2. Phase B Our work in Phase B can be divided into two parts: Pre-processing and Retrieval & Generation, as shown in Figure 2. In the Pre-processing stage, we built BM25 indexes of the golden enriched relevant snippets using Pyserini[42] for later further retrieval. To help generate more accurate answers, we used a Parameter-Efficient Fine-Tuning (PEFT) method, LoRA[43] to fine-tune the Llama2-chat-7B model[4] on the BioASQ corpus, which is the Question-Answering pairs from BioASQ task 12b training 6 https://www.ncbi.nlm.nih.gov/pmc/ Pre-processing Retrieval and Generation Pyserini Golden Enriched Snippet Indexes Relevant Snippets (Provided in Phase B) Snippet 2 ...... Snippet j Snippet 1 Ensemble Query Retriever Question Most Relevant Most Relevant Question 1 Answer 1 Snippet 1 Snippet 2 Question 2 Answer 2 ...... ...... Question N Answer N Generated Training Corpus (QA pairs from BioASQ LoRA Answers training set) Fine-tuning Figure 2: Overview of system CPS-B (Corpus PEFT Searching - phase B), the proposed LLM pipeline with parameter-efficient fine-tuning (PEFT) for BioASQ task 12b phase B. set (consisting of 5049 Q&A pairs). Due to the computing resource limit, we were unable to directly fine-tune the whole Large Language Model on the corpus. The Low-Rank Adaptation (LoRA)[43] helps reduce the fine-tuning time and resources significantly while ensuring the model keeps a satisfying performance. The work in this stage was done before the test sets were released and wouldn’t be repeated later. The Retrieval & Generation stage is the process to generate answers for the query questions in Phase B. Our system leverages an ensemble retriever combining sparse (BM25) and dense (vector similarity) retrievers for searching the most relevant snippets and provides them as references to the LLM. Bge7 was used to embed the chunks into vectors for similarity searching. The reason we constructed the retriever is that we found the performance of directly inputting all snippets into the model are not good enough, which we will discuss in Section 4.3.3. Given appropriate prompts with the retrieved snippets and the query question, the fine-tuned model can generate a response for the query question to be submitted in Phase B. The prompt template is shown below, which we modified from Alpaca8 by adding a role definition at the beginning. You are an expert in the field of biomedical science. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {question} ### Input: {context} ### Response: In this template, the position of ‘context’ will be replaced by the relevant snippets found in the retrieval procedure, while ‘question’ will be replaced by the query question body. 3.3. Phase A+ Phase A+ is more challenging than Phase B since there are no golden-enriched snippets. Therefore, our system needs to first search the relevant documents and snippets from the whole PMC database, and then generate answers. Here, we implement a hierarchical Retrieval Augmented Generation (RAG) pipeline combining the insights from Phase A and B (shown in Figure 3). 7 https://huggingface.co/BAAI/bge-large-en 8 https://github.com/tatsu-lab/stanford_alpaca Pre-processing Retrieval and Generation Pyserini Paper Abstracts from Abstract Indexes Query BM25 PubMed Central Retriever Question Doc 1 Doc 1 ...... Doc i TextSplitter Chunk 1 Chunk 2 ...... Chunk j Ensemble Retriever Question 1 Answer 1 Relevant Chunk 1 Relevant Chunk 2 Question 2 Answer 2 ...... ...... Question N Answer N Generated Training Corpus (QA pairs from BioASQ Answers LoRA training set) Fine-tuning Figure 3: Overview of system CPS-A+ (Corpus PEFT Searching - phase A+), the proposed LLM pipeline with fine-tuning and hierarchical retrieval augmented generation for BioASQ task 12b phase A+. Similarly to Phase B, the framework consists of two stages: Pre-processing and Retrieval & Generation. The fine-tuned model in the Pre-processing stage is identical to that in Phase B, while the indexes are built from the whole PMC abstracts rather than golden-enriched snippets. In the Retrieval & Generation stage, we propose a hierarchical strategy to search question-relevant texts. Specifically, considering the quicker nature of sparse retrieval compared to dense retrieval, we still employed the BM25 sparse retrieval indexed on PMC abstracts as the first-level retrieval. This process selects the most relevant abstracts as the preliminary information screening. These documents are then segmented into multiple chunks by the TextSplitter from LangChain Package9 for the next search. Subsequently, an ensemble retriever which is the same as the retriever in phase B is employed as a more precise second-level retrieval. This ensemble retriever will identify multiple chunks most relevant to the question as the second level of retrieval. Finally, these most relevant chunks and the questions are combined by the prompt template mentioned in Section 3.2 to form the model input. 4. Results and Analyses 4.1. Experiment Settings 4.1.1. Datasets In the official submissions this year, we utilize the training set of BioASQ Task 12b (consisting of 5049 question-answer pairs) as the LoRA fine-tuning (mentioned in Section 3.2 and Section 3.3) dataset for CPS-B and CPS-A+ submissions. In the ablation study experiments in Section 4.3, we employ the training set of BioASQ Task 11b (consisting of 4719 question-answer pairs) as the LoRA fine-tuning dataset and the test set of BioASQ Task 11b batch 1 (consisting of 90 question-answer pairs) as the test set to do the evaluations. 4.1.2. Model Deployment Considering factors such as deployment complexity, open-source availability, and parameter scale, we chose the Llama2-7b-chat mode as the basic model in this work. For not focusing on the model 9 LangChain:langchain_text_splitters.RecursiveCharacterTextSplitter itself, this study adheres to default settings for model parameters and fine-tuning parameters. In all our experiments and submissions, we use fp16 precision for both model fine-tuning and inference. We performed our experiments on a Linux server with two NVIDIA GeForce RTX 3090 GPUs. 4.2. BioASQ Task 12b Official Evaluations In this section, we present the official evaluation results of our system in the four batches of BioASQ Task 12b. For comparison, we have included the best-performing systems in each batch. As there are multiple metrics, we selected the system that ranked in the top 5 among all metrics in each batch. If no system met this criterion, we expanded the selection to include systems ranked in the top 10. The selected system is referred to as the "Top Competitor" in the tables in Section 4.2. 4.2.1. Phase A We utilized the BM25 retrieval-based CPS-A system to retrieve relevant documents in phase A (intro- duced in Section 3.1). Here we show the submission results of the retrieved document in Table 2. As for selecting the relevant document list, we filter the retrieval score by an appropriate threshold optimized on the training set and choose the top 10 documents if the filtered documents are more than 10. Table 2 The evaluation results of our submission in BioASQ Task 12b Phase A - document retrieval using the system CPS-A. System Mean Precision Recall F-Measure MAP GMAP Batch 1 CPS-A 0.1119 0.0842 0.076 0.0661 0.0001 Top Competitor 0.1294 0.3369 0.1728 0.2006 0.0019 Batch 2 CPS-A 0.0717 0.1201 0.073 0.0845 0.0001 Top Competitor 0.1085 0.3580 0.1524 0.2041 0.0022 Batch 3 CPS-A 0.0576 0.1777 0.0778 0.116 0.0002 Top Competitor 0.0980 0.3849 0.1438 0.2487 0.0027 Batch 4 CPS-A 0.0606 0.2304 0.0854 0.1352 0.0003 Top Competitor 0.1239 0.5529 0.1833 0.3773 0.0142 4.2.2. Phase B In this section, we demonstrate the ‘ideal answer’ submission results in BioASQ task 12b phase B using CPS-B in Table 3. Since there is randomness in answer generation, we submitted two or three results in the challenge and we recorded the best one in the table. Table 3 The evaluation results of our submission in BioASQ Task 12b Phase B - ideal answer using the system CPS-B. System R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) Batch 1 CPS-B 0.2873 0.209 0.2936 0.2017 Top Competitor 0.3288 0.3179 0.3353 0.3168 Batch 2 CPS-B 0.2413 0.1709 0.256 0.1694 Top Competitor 0.4333 0.4008 0.4462 0.4035 Batch 3 CPS-B 0.3266 0.2544 0.3124 0.2394 Top Competitor 0.5283 0.3806 0.5525 0.3643 Batch 4 CPS-B 0.2982 0.2147 0.3125 0.216 Top Competitor 0.4398 0.3697 0.4208 0.3458 4.2.3. Phase A+ We used CPS-A+ to complete the submission of Phase A+, and the ‘ideal answer’ results are illustrated in Table 4. We did not submit the result of batch 1 due to the time limit. Table 4 The evaluation results of our submission in BioASQ Task 12b Phase A+ - ideal answer using the system CPS-A+. System R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) Batch 2 CPS-A+ 0.1729 0.1235 0.1949 0.1272 Top Competitor 0.2187 0.1898 0.2292 0.1898 Batch 3 CPS-A+ 0.2414 0.1955 0.2452 0.1946 Top Competitor 0.3138 0.1454 0.3334 0.1470 Batch 4 CPS-A+ 0.2224 0.1376 0.2517 0.1495 Top Competitor 0.2783 0.2461 0.2795 0.2431 4.3. Ablation Study 4.3.1. On Generation Models The capability of the answer generation model has a significant influence on the performance in Phase B. Here we compare the performance of different generation models when taking questions and relevant contexts as input (shown in Table 5). The experiments in the first three rows of Table 5 were carried out by ourselves. In particular, the Llama2-7B-chat + LoRA system is identical to the mentioned system CPS-B, but its dataset has been changed as 4.1.1 stated. The difference between Llama2-7B-chat and CPS-B is that the fine-tuned model is replaced by the original Llama2-7B-chat model. The GPT-3.5-turbo system is based on the API of GPT-3.5-turbo, where the input consists of the query question and the context obtained from the retrieval process of the CPS-B system. In this way, the input in the case of GPT-3.5-turbo is consistent with that of CPS-B. Apart from these systems, we also select baselines and outstanding results in BioASQ task 11b phase 10 B . BioASQ Baseline FS is an official baseline obtained from BioGPT [44], and is detailed described in [38]. This official baseline system uses the concatenation of the question body and the relevant snippets until the input length is exceeded. UR-gpt4 [39] and UR-gpt3.5 [40] are systems using ChatGPT4 and ChatGPT3.5 with corresponding snippets. IISR [41] also used ChatGPT as its main model, but used a different snippet input method as desctibed in 2.2.2. We can observe that the GPT-4 model is quite powerful and can obtain impressive results without fine-tuning (the highest R-2 Recall and the second highest R-SU4 Recall). Besides, our system won the highest place among the three metrics including R-2 F1-score, R-SU4 Recall, and R-SU4 F1-score. Therefore, fine-tuning a relatively small LLM can achieve amazing results and beat a very powerful LLM in specific domains. 4.3.2. On Retrieved Context Beyond the generation model itself, the retrieved context can improve the accuracy of the generated answers. Here we evaluated the performance while there is or isn’t context given to the models, which is shown in Table 6. We can observe that all models didn’t perform very well without relevant context. Llama2-7B-chat + LoRA and BioASQ Baseline ZS are better since they were fine-tuned on biomedical corpora. However, when the relevant context is given, all models have a significant improvement even more than 100% on some metrics. Therefore, Retrieval-Augmented Generation is useful and promising when dealing with domain-specific Qustion-Answering tasks. 10 http://participants-area.bioasq.org/results/11b/phaseB/ Table 5 Performance of different generation models on BioASQ Task 11b - Phase B Test batch1 Model R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) GPT-3.5-turbo 0.2452 0.2999 0.2699 0.3258 Llama2-7B-chat 0.1599 0.1911 0.1181 0.1415 Llama2-7B-chat + LoRA (CPS-B) 0.5609 0.5581 0.5798 0.5734 BioASQ Baseline FS[38] 0.3048 0.2493 0.3026 0.2443 UR-gpt4 [39] 0.5630 0.2136 0.5521 0.1990 UR-gpt3.5 [40] 0.5245 0.1762 0.5209 0.1663 IISR [41] 0.4249 0.4037 0.4138 0.3930 Besides, we explored the influence of the retrieved context quality on the performance from two aspects: Retrieval Unit and Retrieval Source. • Retrieval Unit The golden-enriched snippets provided in Phase B test sets are of course the best retrieval unit for the answer generation since they are validated by biomedical experts. Therefore, we build a baseline treating chunks split from the golden-enriched documents as retrieval units. The ensemble retriever needs to search the most relevant snippets or chunks to provide references for the answer generation model (Llama2-7B-chat + LoRA). The results are illustrated in Table 7, which indicates retrieving from the golden-enriched snippets brings better performance. • Retrieval Source Then we experiment with adding noise to the retrieval process by changing the retrieval sources. The retrieval source means where the retrieval context comes from. We conducted experiments in three different settings: ‘Test Set’, ‘Training Set’, and ‘Training + Test Set’. The ‘Test Set’ setting is identical to the setting in Phase B. The ‘Training + Test Set’ setting means the relevant snippets should be first searched from snippets in the training and test set, which will bring some noise to the given snippets. The ‘Training Set’ setting means the relevant snippets should be searched from snippets only in the training set. Notice that snippets in the training set could be irrelevant in most cases, so it will bring a huge amount of noise. The results in Table 8 also support the conclusion that higher quality of the retrieved contexts will lead to better performance. Table 6 Comparison between systems using different answer generation models on the BioASQ Task 11b - Phase B Test batch1 dataset. Model Context R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) Llama2-7B-chat w/o 0.0732 0.0951 0.0567 0.0713 Llama2-7B-chat w/ 0.1599 0.1911 0.1181 0.1415 Llama2-7B-chat + LoRA w/o 0.2836 0.2481 0.2177 0.1814 Llama2-7B-chat + LoRA (CPS-B) w/ 0.5609 0.5581 0.5798 0.5734 GPT-3.5-turbo w/o 0.0738 0.1089 0.093 0.1362 GPT-3.5-turbo w/ 0.2452 0.2999 0.2699 0.3258 BioASQ Baseline ZS[38] w/o 0.1727 0.0977 0.1936 0.1004 BioASQ Baseline FS[38] w/ 0.3048 0.2493 0.3026 0.2443 4.3.3. On Context Input Method When the generation model and the context were the same, we further explored different context input methods. We constructed a Baseline called ‘Stuff Documents’ here by replacing the ensemble retriever of the system CPS-B as the ‘create stuff chain’11 from the LangChain Package. This chain means stuffing 11 Langchain API: langchain.chains.combine_documents.stuff.create_stuff_documents_chain Table 7 Comparison between systems using different retrieval units on the BioASQ Task 11b - Phase B Test batch1 dataset. Retrieval Unit R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) Snippet 0.5609 0.5581 0.5798 0.5734 Chunk 0.3616 0.3454 0.4042 0.3635 Table 8 Comparison between systems using different retrieval sources on the BioASQ Task 11b - Phase B Test batch1 dataset. Retrieval Unit Retrieval Source R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) Snippet Test Set 0.5609 0.5581 0.5798 0.5734 Snippet Training + Test set 0.4616 0.4793 0.4785 0.4934 Snippet Train Set 0.2095 0.1917 0.2502 0.2235 all the relevant snippets of query questions as the context for the model input. The results in Table 9 show that introducing the retrieval process in CPS-B is more effective than just inputting all the relevant snippets as the context. Table 9 Comparison between our retrieval-based CPS-B system and the ‘Stuff Documents’ baseline on the BioASQ Task 11b - Phase B Test batch1 dataset. Method R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) CPS-B 0.5609 0.5581 0.5798 0.5734 Stuff Documents 0.3661 0.3183 0.4996 0.3572 5. Discussion We constructed the Corpus PEFT Searching (CPS) system step by step following different phases in the BioASQ challenge. Firstly, in phase A the system acquires access to the PubMed Central database with a BM25 retriever. Secondly, in phase B the system integrates the Llama2-7b-chat model fine-tuned by LoRA with an ensemble retriever. Finally, in phase A+, we used an optimized two-stage hierarchical retrieval structure to connect the corpus and answer generation LLM. The first stage employs BM25 for a rough retrieval of documents, while the second stage combines BM25 with vector similarity for a fine-grained retrieval of document chunks. The CPS system can search documents related to query questions and generate appropriate answers. It achieved a mid to slightly higher position on the BioASQ 12b leaderboard rankings. The reason may be that the scale and the context window length of our model are relatively small. However, we made great efforts to improve its performance by utilizing different techniques including sparse and dense ensemble retrievers, Parameter-Efficient Fine-Tuning (PEFT), and hierarchical Retrieval- Augmented Generation (RAG). The experiments in 4.3 demonstrate our exploration of the model fine-tuning and retrieval techniques. We could conclude that the techniques we employed could significantly enhance the model performance without altering its architecture, leveraging the full synthesis capabilities of a 7B-parameter scale LLM. The experiment and analysis elaborate the idea that if the retrieved context contains all the necessary information to complete the task, a small-scale LLM fine-tuned by a parameter-efficient method is fully capable of generating ideal answers. Besides, improving the context input strategy like using ensemble retrievers also has the potential to enhance system efficiency. Experiment results illustrated that our method exceeded the best level of last year’s ranking. However, there is a gap between the system performance in last year and this year’s challenges. In this work, we only applied the BM25 method to retrieve documents or snippets, since BM25 is still a strong retriever in the biomedical RAG domain, according to the comparisons in Xiong et al. [33]. Dense retrievers built for the biomedical area such as MedCPT[32] have the potential to obtain better retrieval results. Future endeavors will focus on in-depth research in exploring more powerful retrieval techniques to improve the generation performance. In addition, the research on LLM has been varying. We will keep up with the pace and conduct more in-depth research and improvement on the generation model itself. Acknowledgments This work was supported by the IdeaBooster Fund IDBF23ENG05 of The Chinese University of Hong Kong. References [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023). [2] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). [4] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [5] OpenAI, Introducing chatgpt, 2023. URL: https://openai.com/blog/chatgpt, openAI Blog, OpenAI. [6] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, et al., Towards expert-level medical question answering with large language models, arXiv preprint arXiv:2305.09617 (2023). [7] S. Wang, Z. Zhao, X. Ouyang, Q. Wang, D. Shen, Chatcad: Interactive computer-aided diagnosis on medical image using large language models, arXiv preprint arXiv:2302.07257 (2023). [8] P. Cramer, Alphafold2 and the future of structural biology, Nature structural & molecular biology 28 (2021) 704–705. [9] J. Chen, Z. Hu, S. Sun, Q. Tan, Y. Wang, Q. Yu, L. Zong, L. Hong, J. Xiao, T. Shen, et al., Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions, arXiv preprint arXiv:2204.00300 (2022). [10] Q. Yu, Z. Dong, X. Fan, L. Zong, Y. Li, Hmd-amp: Protein language-powered hierarchical multi-label deep forest for annotating antimicrobial peptides, arXiv preprint arXiv:2111.06023 (2021). [11] M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, D. Sontag, Large language models are few-shot clinical information extractors, arXiv preprint arXiv:2205.12689 (2022). [12] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474. [13] I. Klerings, A. S. Weinhandl, K. J. Thaler, Information overload in healthcare: too much of a good thing?, Zeitschrift für Evidenz, Fortbildung und Qualität im Gesundheitswesen 109 (2015) 285–290. [14] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [15] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [16] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for Biomedical Question Answering, Scientific Data 10 (2023) 170. [17] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, Journal of Machine Learning Research 25 (2024) 1–53. [18] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, P. Szolovits, What disease does this patient have? a large-scale open domain question answering dataset from medical exams, Applied Sciences 11 (2021) 6421. [19] A. Venigalla, J. Frankle, M. Carbin, Biomedlm: a domain-specific large language model for biomedical text, MosaicML. Accessed: Dec 23 (2022) 2. [20] H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, L. Huang, Q. Wang, D. Shen, Doctorglm: Fine-tuning your chinese doctor is not a herculean task, arXiv preprint arXiv:2304.01097 (2023). [21] Y. Tan, M. Li, Z. Huang, H. Yu, G. Fan, Medchatzh: a better medical adviser learns from better instructions, arXiv preprint arXiv:2309.01114 (2023). [22] T. Han, L. C. Adams, J.-M. Papaioannou, P. Grundmann, T. Oberhauser, A. Löser, D. Truhn, K. K. Bressem, Medalpaca–an open-source collection of medical conversational ai models and training data, arXiv preprint arXiv:2304.08247 (2023). [23] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo: Tuning llama model with chinese medical knowledge, arXiv preprint arXiv:2304.06975 (2023). [24] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-efficient transfer learning for nlp, in: International conference on machine learning, PMLR, 2019, pp. 2790–2799. [25] X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, arXiv preprint arXiv:2101.00190 (2021). [26] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (2009) 333–389. [27] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering, arXiv preprint arXiv:2004.04906 (2020). [28] P. Zhang, S. Xiao, Z. Liu, Z. Dou, J.-Y. Nie, Retrieve anything to augment large language models, arXiv preprint arXiv:2310.07554 (2023). [29] M. Yasunaga, J. Leskovec, P. Liang, Linkbert: Pretraining language models with document links, arXiv preprint arXiv:2203.15827 (2022). [30] M. Yasunaga, A. Bosselut, H. Ren, X. Zhang, C. D. Manning, P. S. Liang, J. Leskovec, Deep bidirectional language-knowledge graph pretraining, Advances in Neural Information Processing Systems 35 (2022) 37309–37323. [31] X. Zhang, A. Bosselut, M. Yasunaga, H. Ren, P. Liang, C. D. Manning, J. Leskovec, Greaselm: Graph reasoning enhanced language models for question answering, arXiv preprint arXiv:2201.08860 (2022). [32] Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, Z. Lu, Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval, Bioinformatics 39 (2023) btad651. [33] G. Xiong, Q. Jin, Z. Lu, A. Zhang, Benchmarking retrieval-augmented generation for medicine, arXiv preprint arXiv:2402.13178 (2024). [34] Q. Jin, Y. Yang, Q. Chen, Z. Lu, Genegpt: Augmenting large language models with domain tools for improved access to biomedical information, Bioinformatics 40 (2024) btae075. [35] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021). [36] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems 35 (2022) 24824–24837. [37] W. Shi, Y. Zhuang, Y. Zhu, H. Iwinski, M. Wattenbarger, M. D. Wang, Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-making, in: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2023, pp. 1–10. [38] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima López, E. Farré-Maduell, L. Gasco, M. Krallinger, G. Paliouras, Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2023, pp. 227–250. [39] S. Ateia, U. Kruschwitz, Is chatgpt a biomedical expert, Exploring the Zero-Shot Performance of Current GPT Models in Biomedical Tasks (2023). [40] A. Aksenova, T. Asamov, P. Ivanov, S. Boytcheva, Improving biomedical question answering with sentence-based ranking at bioasq-11b, in: Conference and Labs of the Evaluation Forum, 2023. URL: https://api.semanticscholar.org/CorpusID:264441330. [41] C.-Y. Hsueh, Y. Zhang, Y.-W. Lu, J.-C. Han, W. Meesawad, R. T.-H. Tsai, Ncu-iisr: Prompt engineer- ing on gpt-4 to stove biological problems in bioasq 11b phase b, in: 11th BioASQ Workshop at the 14th Conference and Labs of the Evaluation Forum (CLEF), 2023. [42] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A Python toolkit for repro- ducible information retrieval research with sparse and dense representations, in: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), 2021, pp. 2356–2362. [43] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2021. [44] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, T.-Y. Liu, Biogpt: generative pre-trained transformer for biomedical text generation and mining, Briefings in bioinformatics 23 (2022) bbac409.