÷ NCU-IISR: Enhancing Biomedical Question Answering with GPT-4 and Retrieval Augmented Generation in BioASQ 12b Phase B Bing-Chen Chih1 , Jen-Chieh Han1 and Richard Tzong-Han Tsai1,2,* 1 Department of Computer Science and Information Engineering, National Central University, Taiwan 2 Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan Abstract In this paper, we introduce our system and submissions in BioASQ 12b phase b [1], highlighting a significant improvement with GPT-4 and the integration of Retrieval Augmented Generation (RAG) techniques. We de- scribe our prompt engineering methods and the experimental procedures followed. Because GPT-4 has proven effectiveness in generating answers and its ability in the biological domain, our system utilizes GPT-4 to address biomedical question-answering (QA). Leveraging OpenAI’s ChatCompletions API, we refined previous prompt engineering approaches [2] for BioASQ 11b phase b. This year, the addition of RAG techniques significantly improved the information retrieval capabilities of our system. Consequently, our latest submission employed what we experimented to be the most effective prompts and techniques, achieving excellent performance across multiple metrics in the fourth batch. Keywords Biomedical Question Answer, Large Language Models (LLMs), Generative Pre-trained Transformer, Retrieval Augmented Generation 1. Introduction BioASQ [3] has been at the forefront of advancing biomedical semantic indexing and question-answering through its annual challenges since 2013. The 12th time of BioASQ, specifically Task 12b Phase B[1], tasks participants with generating exact or ideal answers to biomedical questions using provided text snippets. This year’s training dataset comprises 5,046 questions, which includes the previous year’s test set annotations with gold answers, along with 340 new test questions for evaluation. These questions are organized into four batches, each containing 85 questions, meticulously crafted by a team of biomedical experts. The questions in Task 12b Phase B are categorized into four types: yes/no, factoid, list, and summary. Among these, the yes/no, factoid, and list questions require exact answers, while all types require an ideal answer. Participants can submit up to five results per batch, encouraging continuous optimization of their models and techniques. By structuring these rigorous challenges, BioASQ aims to drive innovation and enhance the capabilities of information retrieval systems in the biomedical domain. Table 1 illustrates examples across four categories in the BioASQ dataset. Each instance contains a question along with several snippets, and answers are categorized into "ideal answer" and "exact answer." Notably, in the "summary" category, there is no requirement for an "exact answer." Last year, we leveraged the understanding capabilities of GPT-4 combined with prompt engineering techniques, achieving great results. This year, we continue to utilize GPT-4’s comprehension abilities while incorporating Retrieval-Augmented Generation (RAG) techniques. By harnessing RAG’s retrieval capabilities, we enhance the model’s domain knowledge, thereby improving output performance. Additionally, we conducted a deeper analysis of the dataset and refined the answer generation approach. Furthermore, CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ charlie963258@gmail.com (B. Chih); joyhan@cc.ncu.edu.tw (J. Han); thtsai@csie.ncu.edu.tw (R. T. Tsai) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings we continued to implement prompt engineering techniques, ensuring a robust and effective approach to improving model responses. Yes/No Question Can modulation of KCNQ1 splicing prevent arrhythmias? Exact Answer yes Amiloride reduces arrhythmogenicity through the modulation of Ideal Answer KCNQ1 splicing. Therefore, the modulation of KCNQ1 splicing may help prevent arrhythmias. List Question Which drugs are included in the AZD7442? Exact Answer [tixagevimab, cilgavimab] AZD7442 is a combination of two long-acting monoclonal antibodies Ideal Answer tixagevimab and cilgavimab. It has been authorized for the prevention and treatment of coronavirus disease 2019 (COVID-19). Factoid Question Olokizumab is tested for which disease? Exact Answer [rheumatoid arthritis] Olokizumab, a monoclonal antibody against interleukin 6, improves Ideal Answer outcomes of rheumatoid arthritis. Summary Question What is the definition of dermatillomania? Dermatillomania is a condition that leads to repetitive picking of their skin ending up in skin and soft tissue damage. It is a chronic, Ideal Answer recurrent, and treatment resistant neuropsychiatric disorder with an underestimated prevalence that has a concerning negative impact on an individual’s health and quality of life. Table 1 Examples across four categories in the BioASQ dataset 2. Related Work The biomedical domain is characterized by extensive specialized knowledge and complex terminology, making the process of acquiring and applying this information both intricate and time-consuming. Traditional methods often involve reading a substantial number of academic papers, which requires not only significant professional expertise but also considerable effort and time. This approach is inefficient, failing to quickly meet the needs of both professionals and the general public. Natural Language Processing (NLP) based Question-Answering (QA) systems provide a promising solution to these challenges. By leveraging advanced language models, these systems can interpret, retrieve, and generate responses from medical texts, significantly enhancing the efficiency of QA tasks. Consequently, QA systems streamline the process of accessing biomedical information, making it faster and more efficient for both experts and the general public. With the continuous advancements in deep learning technologies, QA models based on these techniques are progressively bridging the gap between complex biomedical data and practical usability, facilitating broader knowledge dissemination and application. Prompt Engineering Prompt engineering has emerged as a critical technique in the field of natural language processing (NLP) and machine learning, particularly in the utilization of large language models like GPT-3 and GPT-4 [4]. This technique involves crafting specific prompts or input queries that guide the language model to produce desired outputs. Various studies have highlighted the effectiveness of prompt engineering in improving the performance of language models across different tasks. For instance, Brown et al. [5] demonstrated that by carefully designing prompts, the accuracy of few-shot learning in GPT-3 significantly increased, enabling the model to perform complex tasks with minimal examples. This approach has been widely adopted in various applications, including question-answering, text summarization, and language translation. Retrieval Augmented Generation Retrieval Augmented Generation (RAG) is a technique that combines retrieval-based methods with generative models to enhance the relevance and accuracy of generated text. Initially introduced by Lewis et al. [6], RAG has demonstrated significant improvements in open-domain question answering by retrieving relevant documents and using them to inform the generation process. In the RAG framework, the process typically involves two main components: the retriever and the generator. The retriever is responsible for fetching relevant documents or snippets from a large corpus based on the input query. This is usually achieved using a dense passage retrieval (DPR) model, which encodes both the query and the documents into dense vectors and retrieves the most similar documents. The generator then takes these retrieved documents, concatenates them with the query, and generates a response using a generative model such as GPT-4. This combination allows the generative model to produce more accurate and contextually relevant answers by leveraging the additional context provided by the retrieved documents. In the biomedical domain, RAG has been particularly beneficial due to the complexity and specificity of the information. Systems utilizing RAG can retrieve pertinent biomedical literature, thus improving the contextual relevance and accuracy of the generated answers. This approach has shown promising results in challenges such as BioASQ. Our work leverages RAG to enhance our GPT-4-based system for the BioASQ 12b phase B challenge. By integrating RAG, we aim to improve the retrieval and utilization of relevant biomedical documents, ensuring that generated answers are well-supported by accurate and relevant information. This integra- tion enhances the generative capabilities of GPT-4, providing more reliable answers in the biomedical context. 3. Method 3.1. Dataset The BioASQ Task 12b Phase B dataset [7] provided 5,049 training data samples, comprising 1,210 summary type samples, 1,515 factoid type samples, 967 list type samples, and 1,357 yes/no type samples. Each sample included multiple snippets along with their source documents. Last year, due to the token limit of OpenAI’s API, we summarized the snippets and selected the top five snippets. This year, we resolved this practice and found that the performance did not drop, while allowing us to access more comprehensive information. 3.2. Prompting Snippets: For each question, we incorporate all available snippets as references. Although last year we observed that the top five snippets could cover most of the necessary information, we found that considering all snippets results in more accurate model outputs. Therefore, we provide the model with all snippets listed before the question and prompt for reference. To use Retrieval-Augmented Generation (RAG), we compile all snippets into a database for the model to retrieve relevant information effectively. Questions: When directly using GPT-4 to generate both the ideal answer and the exact answer simultaneously, we retained most of the prompts from last year. We instructed the model to generate responses in JSON format and to keep the replies as concise as possible. In cases where the ideal answer and exact answer are generated separately, we first focused on generating a high-quality ideal answer. This is based on our observation that the entities in the exact answer typically appear in the ideal answer. Therefore, the model is first tasked with generating a well-crafted ideal answer, and then it generates the exact answer based on this ideal answer. In both stages, snippets are provided, and during the exact answer generation, few-shot examples are included to ensure accuracy.Please refer to Table 2 for the details of the relevant prompts for generating answer separately, and the other prompts are mostly the same with the prompts we used last year. In Batch-4, we observed that the ideal answers from previous tests often contained complete segments from the snippets. Therefore, in this batch, we instructed the model to duplicate snippet fragments into the ideal answer during generation. This approach resulted in significantly improved outcomes. Table 2 The prompts using for generating ideal answer and exact answer separately Tasks Prompts Reply to the answer clearly and easily in less than 3 sentences. You should read the Ideal Answer chat history’s content before answer the question. You can directly copy part of the above snippets as part of your answer. The question is: {QUESTION_BODY} Please answer me only yes or no. You should read the ideal_answer and snippets before Yes/No answer the question. Please answer me and follow the following rules: 1. Give me a list of precise key List entities to answer the question, as clear and concise as possible. 2. You should read the ideal_answer and snippets before answer the question. Please answer me and follow the following rules: 1. Give me a list of precise key entities to answer the question, as clear and concise as possible. 2. The list should contain at Factoid least 1 and up to 5 entity names, ordered by decreasing confidence. 3. You should read the ideal_answer and snippets before answer the question. 3.3. Strategy Our approach primarily consists of two strategies for generating answers: direct generation of both ideal and exact answers, and sequential generation of these answers. This idea draws from the chain-of- thought methodology [? ]. In cases where we observed that generating exact answers directly often resulted in responses that were either imprecise or overly verbose, we provided few-shot examples to refine the accuracy of the answers. However, our experiments showed that separate generation does not consistently outperform direct generation; in most scenarios, direct generation proved sufficiently effective. To enhance the model’s understanding of the provided snippets or documents, we adopted the Retrieval-Augmented Generation (RAG) [6] technique. In our implementation, we used the OpenAI "text-embedding-ada-002" embedding model for embedding both the query and the snippets, which provided high-quality dense representations and improved retrieval accuracy. The retrieval process involved encoding the input query and the snippets using the OpenAI embedding model, retrieving the top-k(k = 4) most relevant snippets via computing cosine similarity between the query and documents, and then directly concatenating these snippets with the query to form the prompt for GPT-4. Although there is a risk of retrieving incorrect fragments, our experiments indicate that this risk has minimal impact on the task’s overall performance. In our detailed analysis of the dataset, we observed that standard ideal answers from previous years often contained segments identical to those in the snippets. As a result, in Batch-4, we modified our prompts to allow the model to appropriately duplicate snippet fragments into the answers. This adjustment led to improvements in automated evaluations. Furthermore, we experimented with adjusting the model’s temperature settings. We found that setting the temperature to 0 often produced the highest quality outputs, although this setting was not consistently stable and sometimes resulted in suboptimal performance in certain cases.Overall, our integration of RAG significantly enhanced the performance of our system by providing more contextually relevant and accurate information, thus improving the quality of the generated answers in the BioASQ 12b challenge. 3.4. Systems We use different systems in different batches. The detailed configuration of each system can be seen in Table 3. Please note that Batch-1 has no recorded configuration due to some errors. Table 3 All submitted systems’ settings. In the generation process, one approach directly outputs both the ideal and exact answers and the other approach involves initially generating the ideal answer, followed by deriving the exact answer based on it. Additionally, in certain configurations, parts of the snippet fragments are incorporated into the prompt to enhance the output. Batch System Name Generation Strategy RAG Duplicate IISR-1 Split Yes No IISR-2 Direct No No Batch-2, 3 IISR-3 Split No No IISR-4 Direct Yes No IISR-5 Split Both RAG No IISR-1 Split Yes Yes IISR-2 Direct No Yes Batch-4 IISR-3 Split No Yes IISR-4 Direct Yes Yes IISR-5 Split Both RAG Yes Table 4 The Exact Answers test results on BioASQ. We define FIN scores as the average of Accuracy in Yes/No, MRR in Factoid, and F-Measure in List. Yes/No Factoid List Batch System Acc maF1 SAcc LAcc MRR Precision Recall F1 IISR-1 0.8800 0.8792 0.2381 0.2857 0.2540 0.6317 0.5426 0.5692 Batch-1 IISR-2 0.9200 0.9188 0.2381 0.2381 0.2381 0.5813 0.5066 0.5244 IISR-3 0.8800 0.8768 0.2381 0.2381 0.2381 0.5417 0.5300 0.5218 IISR-4 0.9600 0.9589 0.1905 0.1905 0.1905 0.5449 0.5186 0.5208 IISR-5 0.9200 0.9188 0.2381 0.2381 0.2381 0.5466 0.5309 0.5301 IISR-1 0.9231 0.9150 0.3684 0.3684 0.3684 0.6166 0.4915 0.5261 Batch-2 IISR-2 0.9231 0.9150 0.5263 0.5263 0.5263 0.5784 0.5247 0.5456 IISR-3 0.7692 0.7451 0.2632 0.2632 0.2632 0.4981 0.4443 0.4610 IISR-4 0.9615 0.9585 0.4211 0.4211 0.4211 0.5436 0.4515 0.4840 IISR-5 0.8846 0.8689 0.4211 0.4211 0.4211 0.5595 0.4947 0.5023 IISR-1 0.9167 0.9111 0.3077 0.3077 0.3077 0.3476 0.6010 0.4118 Batch-3 IISR-2 0.9583 0.9564 0.4231 0.4231 0.4231 0.5536 0.5483 0.5475 IISR-3 0.9583 0.9564 0.3077 0.3077 0.3077 0.3316 0.2670 0.2900 IISR-4 1.0000 1.0000 0.4231 0.4231 0.4231 0.5452 0.5187 0.5247 IISR-5 0.9583 0.9564 0.3846 0.3846 0.3846 0.4768 0.4664 0.4677 IISR-1 0.8889 0.8782 0.4211 0.4211 0.5132 0.6230 0.6097 0.6103 Batch-4 IISR-2 0.9630 0.9571 0.5789 0.6316 0.5965 0.6226 0.5594 0.5828 IISR-3 0.9630 0.9571 0.5263 0.6316 0.5702 0.5558 0.4791 0.5069 IISR-4 0.9259 0.9112 0.5789 0.6316 0.5965 0.6735 0.6393 0.6460 IISR-5 0.9259 0.9112 0.4211 0.5789 0.4649 0.5839 0.5207 0.5442 4. Result Our results are presented separately for Exact Answer (Table 4) and Ideal Answer (Table 5). We observed significant improvements in Batch-4, where prompt modifications and strategy adjustments were implemented. These changes led to noticeable performance improvement. Although the Manual Score has not yet been released, we achieved competitive rankings across various metrics in the automated evaluations. Our experiments demonstrated that employing Retrieval-Augmented Generation (RAG) to enhance domain knowledge comprehension significantly benefited our Question Answering system, proving to be a crucial component worth considering. Table 5 The Ideal Answers test results on BioASQ. Batch System R-2 (Rec) R-2 (F1) R-SU4 (Rec) R-SU4 (F1) IISR-1 0.2796 0.2339 0.2826 0.2292 IISR-2 0.2625 0.1980 0.2660 0.1907 Batch-1 IISR-3 0.3379 0.1366 0.3400 0.1270 IISR-4 0.3675 0.1289 0.3774 0.1197 IISR-5 0.3750 0.1442 0.3924 0.1369 IISR-1 0.3439 0.1801 0.3657 0.1725 IISR-2 0.2365 0.1840 0.2457 0.1786 Batch-2 IISR-3 0.2338 0.1421 0.2577 0.1448 IISR-4 0.2840 0.2124 0.2962 0.2065 IISR-5 0.3382 0.1818 0.3548 0.1731 IISR-1 0.4157 0.2480 0.4154 0.2343 IISR-2 0.3321 0.2890 0.3284 0.2775 Batch-3 IISR-3 0.3192 0.2225 0.3221 0.2139 IISR-4 0.3818 0.3170 0.3753 0.3016 IISR-5 0.4187 0.2493 0.4175 0.2340 IISR-1 0.4141 0.3451 0.4045 0.3250 IISR-2 0.4120 0.3475 0.4008 0.3286 Batch-4 IISR-3 0.4161 0.2998 0.4025 0.2794 IISR-4 0.4398 0.3697 0.4208 0.3458 IISR-5 0.4188 0.3505 0.4115 0.3310 5. Discussion and Conclusions In this task, we conducted extensive testing on various techniques. Initially, we considered that focusing the model on generating only one type of answer might yield better performance. However, our experiments revealed that the model’s ability to simultaneously generate both ideal and exact answers was equally effective. By utilizing the Retrieval-Augmented Generation (RAG) technique, we calculated the similarity between queries and medical texts, extracting text directly relevant to the questions. This allowed the model to focus on high-quality data when formulating answers. Our experiments confirmed that this approach improved the model’s performance. Future work could explore strategies for segmenting snippet documents, which may further enhance effectiveness. Additionally, continuous refinement of prompt engineering and RAG integration could lead to even more significant improvements in answer accuracy and relevance. References [1] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [2] Y.-W. L. J.-C. H. W. M. R. T.-H. T. Chun-Yu Hsueh, Yu Zhang, Ncu-iisr: Prompt engineering on gpt-4 to stove biological problems in bioasq 11b phase b, CEUR Workshop Proceedings 3497 (2023). [3] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [4] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774. [5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020). [6] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, volume 33, 2020, pp. 9459–9474. URL: https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract. html. [7] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for Biomedical Question Answering, Scientific Data 10 (2023) 170.