Using Pretrained Large Language Model with Prompt Engineering to Answer Biomedical Questions DS@GT CLEF2024 BioASQ Task 12b and Synergy Task Working Note Wenxin Zhou1 , Thuy Hang Ngo2,* 1 Georgia Institute of Technology, North Ave NW, Atlanta, GA 30332, United States Abstract Our team participated in the BioASQ 2024 Task12b and Synergy tasks to build a system that can answer biomedical questions by retrieving relevant articles and snippets from the PubMed database and generating exact and ideal answers. We propose a two-level information retrieval and question-answering system based on pre-trained large language models (LLM), focused on LLM prompt engineering and response post-processing. We construct prompts with in-context few-shot examples and utilize post-processing techniques like resampling and malformed response detection. We compare the performance of various pre-trained LLM models on this challenge, including Mixtral, OpenAI GPT and Llama2. Our best-performing system achieved 0.14 MAP score on document retrieval, 0.05 MAP score on snippet retrieval, 0.96 F1 score for yes/no questions, 0.38 MRR score for factoid questions and 0.50 F1 score for list questions in Task 12b. Keywords large language model, prompt engineering, biomedical information retrieval, biomedical question answering 1. Introduction BioASQ is a challenge for large-scale biomedical semantic indexing and question answering hosted by CLEF. The BioASQ12b and the Synergy tasks [1] are part of the CLEF 2024 BioASQ lab[2], which focuses on biomedical question answering and information retrieval. The challenge consists of four types of questions: yes/no, factoid, list and summary. The participating systems need to perform two subtasks. The first subtask is to retrieve 10 relevant documents and snippets from the PubMed database that can answer the question. PubMed [3] is a search engine for biomedical literature, which contains millions of abstracts of biomedical articles. The system is evaluated by the relevance of the retrieved documents and snippets using the mean average precision (MAP) metric. The second subtask is to generate an exact answer and an ideal answer for each question. The exact answer is a short answer that directly answers the question. For yes/no questions, this is a single word "yes" or "no". For list and factoid questions, the short answer is a list of entities. The ideal answer is a long answer that provides more context and details. The system is evaluated based on the quality and accuracy of the generated answers. The evaluation metric is F1 score for yes/no questions, mean reciprocal rank (MRR) for factoid questions and F1 score for list questions. The ideal answer is scored manually based on the readability, recall, precision and repetition of the answers. An example of the input and output format is shown in Figure 1. The organizers provide BioASQ-QA dataset[4], which contains around 4721 questions from the past BioASQ challenges where 27% are yes/no questions, 29% factoid, 24% summary and 20% list. We build a system based on pre-trained large language models for document retrieval and question- answering. Although some solutions of previous years used large language models, they only ex- perimented with OpenAI GPT models and basic prompt engineering. In this year’s challenge, we experiment with various well-known large language models and use prompt engineering and response CLEF 2024: Conference and Labs of the Evaluation Forum, September 09-12, 2024, Grenoble, France * Corresponding author. $ wzhou77@gatech.edu (W. Zhou); tngo62@gatech.edu (T. H. Ngo)  0009-0002-3325-3357 (W. Zhou) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: BioASQ Data Format post-processing techniques to improve the performance of the system. At high level, we use LLM model to extract keywords from the question and compose PubMed query to retrieve documents from PubMed database, then use sentence embeddings to find the relevant snippets from the documents. For question answering, we use the snippets as context and construct few-shot examples prompts to guide the LLM to generate the answers in the desired format. In this paper, we will discuss the modeling pipeline, prompt engineering strategies as well as the experiment results with various LLM models on the Synergy and Task12b tasks. Our implementation can be found on Github1 . 2. Related Work Large language models (LLM) have shown great success recently in various natural language processing tasks, including text generation in the biomedical domain. Chen et al. [5] measured the performance of LLM on Biomedical Language Understanding and Reasoning Benchmark (BLURB), demonstrating the potential of LLM in understanding and reasoning in the biomedical domain. Prompt engineering [6] is a technique that improves the performance of LLM for domain specific tasks. In-context few- shot examples in the prompt can help LLM to generate more accurate answers, without the need for fine-tuning the model. Some well-known LLM models include OpenAI GPT [7], Meta Llama2 [8], and Mistral AI’s Mixtral [9] models. 2.1. Information Retrieval Approaches in BioASQ 2023 In the eleventh BioASQ challenge, two predominant methodologies were employed for the Information Retrieval (IR) task, typically segmented into a two-stage pipeline: retrieval and reranking. For the retrieval stage, the majority of the systems (7 out of 8) involved in Task 11B phase A adopted a BM25 model for the initial document retrieval [10]. BM25 models rely on indexing the entire corpus of documents which is computationally expensive since more than a million biomedical papers are added to the PubMed database each year [11]. The advantage of this method is the comprehensive list of documents that could be retrieved. 1 https://github.com/dsgt-kaggle-clef/bioasq-2024/ Figure 2: Information Retrieval System Design In contrast, Ateia and Kruschwitz [12] utilized LLMs for retrieval via zero-shot learning by mirroring the expert’s workflow in curating the BioASQ QA dataset. This method does not require additional computation to index the document corpus, but may not be as comprehensive as the BM25 method. The system did not achieve the highest IR outcomes in last year’s competition. Most participating systems generated language model embeddings for reranking articles. This process was handled by a dedicated reranking model or through cosine similarity measures [10]. One system employed zero-shot prompting with an LLM to rank the documents, but was resource-intensive and limited by the maximum context length of the model. 2.2. Question Answering Approaches in BioASQ 2023 For the Question Answering (QA) or Phase B task, several systems exploited the capabilities of LLMs by prompting it with essential snippets of information. A comparative analysis revealed that GPT-4 outperformed an ensemble of fine-tuned BERT models, indicating that GPT-4 is more effective at navigating the complexities of biomedical question answering [13]. 3. Methodology 3.1. Information Retrieval We propose a two-stage IR system for the BioASQ task (in Figure 2). The first stage retrieves a set of relevant documents from the PubMed database using PubMed search query [14]. The second stage ranks documents by cosine similarity of sentence embeddings to find the most relevant sentences. 3.1.1. Query Constructor The query constructor creates the query for PubMed search [3]. To match the PubMed version defined by the organizer, we set the maxdate parameter in the esearch API to be the date defined by the specific batch of the Synergy task. Task12b requires PubMed 2024 baseline, so we set the maxdate parameter to 2024-01-01 as an approximation. The query constructor uses two approaches. Approach 1: Keyword extraction. We use LLMs or language models finetuned for biomedical termi- nology to extract keywords (such as biomedical entities) from the question. Then we concatenate those keywords with "AND" to form a PubMed query. For LLMs, we send a few-shot example prompt, shown in Table 1 to generate the keywords. For the biomedical language model, we use en_ner_bc5cdr_md, a spaCy biomedical NER named entity recognition language model trained on BC5CDR corpus [15] to extract the keywords from the question sentence. Table 1 Query Constructor Prompt, Approach 1 Q: What is the mode of action of Molnupiravir? Keywords: Molnupiravir, action ### Q: Is dapagliflozin effective for COVID-19? Keywords: dapagliflozin, COVID-19 ### Q: {question body} Keywords: Approach 2: Direct query generation. We directly generate a query from the question using the large language model, which is inspired by Ateia and Kruschwitz [12]. The prompt template is composed of an instruction and two examples, as shown in Table 2. Table 2 Query Constructor Prompt, Approach 2 Given a question, expand into a search query for PubMed by incorporating synonyms and additional terms that would yield relevant search results from PubMed to the provided question while not being too restrictive. Assume that phrases are not stemmed; therefore, generate useful variations. Return only the query that can directly be used without any explanation text. Question: What is the mode of action of Molnupiravir? Query: Molnupiravir AND ("mode of action" OR mechanism) ### Question: Is dapagliflozin effective for COVID-19? Query: dapagliflozin AND (COVID-19 OR SARS-CoV-2 OR coronavirus) AND (efficacy OR effective OR treatment) ### Question: Name monoclonal antibody against SLAMF7. Query: "SLAMF7" AND ("monoclonal antibody" OR "monoclonal antibodies") ### Question: {body} Query: 3.1.2. Reranker The reranker ranks documents by calculating the relevance between documents and questions. We use a sentence transformer, specifically all-MiniLM-L6-v2 [16] to generate embeddings for the documents and the questions. When the document length is larger than the maximum input length of the sentence transformer, the document is truncated to fit the input length. We calculate the cosine similarity between the embeddings of the question and the document and use the descending score to rank the documents. 3.1.3. Snippet Extraction After identifying the top 10 documents, we break the documents into sentences and rank the sentences based on the similarity score using the same sentence transformer and similarity calculation method as the re-ranker. We then select the best sentence of each document as the prediction snippets. Figure 3: Question Answering System Design 3.2. Question Answering We use pre-trained large language models (LLM) to generate answers for biomedical questions. In this project, instead of fine-tuning the LLM models, we use prompt engineering and response post- processing to build the system, as illustrated in Figure 3. The two key components of the prompt are the context and the few-shot examples. We use the first 1000 words of the top 10 snippets for the question, where each snippet consists of one or more sentences of a PubMed abstract, as the context for the question answering. The snippets are generated from the information retrieval system or are the golden snippets provided by the organizers. The reasoning behind using the first 1000 words of snippets is that the higher rank snippets contain the more relevant information to the question. The context is crucial for generating high-quality answers and reducing model hallucinations. Then we construct few-shot examples from the training dataset. The few-shot examples help LLM to generate answers in the desired format. The template prompt used for yes/no questions is shown in Table 3. The maximum input token size of most LLM models we experimented with is larger than 4096, so a prompt consisting of few-shot examples, 1000 words (which is roughly equivalent to 1350 tokens) context as well as the question body is within the LLM input token limit. Table 3 Prompt Template for Question Answering, Yes/no Question Context: Papilins are homologous, secreted extracellular matrix proteins which share a common order of protein domains. Question: Is the protein Papilin secreted? Ideal answer: Yes, papilin is a secreted protein Exact answer: yes ### Context: Most lncRNAs are under lower sequence constraints than protein-coding genes and lack conserved secondary structures, making it hard to predict them computationally. Question: Are long non coding RNAs as conserved in sequence as protein coding genes? Ideal answer: No. Most long non coding RNAs are under lower sequence constraints than protein-coding genes. Exact answer: no ### Context: {context} Question: {question_body} Ideal answer: The prompt templates used for factoid, list and summary questions are similar to the yes/no question template with some modifications to the ideal and exact answer formatting. Those templates can be found in the appendix A. "###" is used as the separator for examples in the prompt, which is also used as the "stop" string of the LLM completion. The prompt is intentionally designed to end with "Ideal answer:" to guide the LLM to generate the ideal answer. We expect the "exact answer" line to be generated after the ideal answer in the LLM response as illustrated in appendix B. In terms of the other parameters (such as temperature and top_k) for the LLM completion, we use the default values defined by the TextSynth service [17], except that the max_tokens parameter, which controls maximum number of tokens in the LLM output is set to 200. We did not experiment with using different roles or system and user prompts since not all models we experimented with are fined-tuned for chat, so we only relied on the basic completion functionality of LLM to generate the answers. For list questions, we experiment with synonym grouping. The idea is to group the synonyms among all the LLM responses to reduce the repetition of the answers. This is similar to having LLM perform a second-stage reasoning. The synonym grouping is accomplished by sending a prompt (shown in Table 4) to LLM to group the synonyms. The prompt contains a list of entities, which aggregates all entities returned from multiple responses from LLM for the list question with different prompt contexts. Table 4 Prompt Template for Synonym Grouping Group the phrases with the same meaning in the ENTITY list into separate lines as follows. [ENTITY]: MOG-IgG; AQP4; MOG-IgG; serum neurofilament light chain; NfL; aquaporin-4 (AQP4)- immunoglobulin G (IgG) [GROUP1]: aquaporin-4 (AQP4)-immunoglobulin G (IgG); AQP4; MOG-IgG [GROUP2]: serum neurofilament light chain; NfL ### [ENTITY]: {entity_list} [GROUP1]: 3.2.1. Context formation We started with the first 1000 words of the top 10 snippets as the prompt context. The system we submit to Synergy uses this basic version. We experiment with different contexts for the QA system by changing the number and variety of snippets used. Our final context setup used for batch 2 and 3 in Task 12b is as follows: 1. For yes/no questions, we create three prompts with different contexts. The context of the first prompt is the first golden snippet, the context of the second prompt is the second golden snippet, and the context of the third prompt is the third golden snippet. We send all three prompts separately to the LLM and take the majority vote of the answers as the final answer. 2. For factoid and summary questions, we use one prompt with the first 1000 words of the golden snippets as the context. 3. For list questions, we use one prompt with the first 1000 words of the golden snippets as the context. In the synonym grouping setting, we compose five prompts. The context of each prompt is a golden snippet. The first prompt context is the first golden snippet, the second prompt context is the second golden snippet, and so on. 3.2.2. Response Post-processing Since the prompt we send to LLM has examples with the desired answer format, the answers generated by LLM are usually in the form of a long answer, followed by an exact answer in the second line. We extract answers by parsing the response. For list and factoid questions we give examples where entities are separated by semicolons, therefore we extract resulting entities by splitting the exact answer string by semicolons. Examples of prompt and response can be found in appendix B. When the answer does not follow the expected format, we resample the LLM output. Some checks we use to detect malformed answers detection are: • There is no "exact answer" string in the response for yes/no, factoid and list questions. • The exact answer is not "yes" or "no" for yes/no questions. • The exact answer for list or factor questions separates entities by commas or newlines instead of semicolons. 4. Results In this section, we present and discuss the results of our systems for the BioASQ Task 12b and Synergy tasks. The results are based on the official evaluation scores provided by the organizers in the competition leaderboard for BioASQ2024. 4.1. Synergy Task We submitted five systems for the Synergy task to measure the performance of different pre-trained large language models and strategies. The system configurations are outlined in Table 5. Table 5 Synergy Task System Configuration Leaderboard QA has System IR algorithm QA Model Name context Gatech competi- spaCy biomedical NER system1 Mistral 7B No tion model LLM (Mistral 7B) + few- system2 GTBioASQsys2 Mistral 7B Yes shot prompt LLM (Mixtral 47B) + few- Mixtral 47B system3 GTBioASQsys3 Yes shot prompt Instruct LLM (Llama2 70B) + few- system4 GTBioASQsys4 Llama2 70B Yes shot prompt LLM (GPT-J 6B) + few- system5 GTBioASQsys5 GPT-J 6B Yes shot prompt For the information retrieval (IR) part, system 1 uses en_ner_bc5cdr_md to extract the keywords from the questions. The rest of the systems use large language generative models (LLM) to extract the keywords. The language models used for systems 2,3,4,5 are Mistral 7B, Mixtral 47B (i.e, Mixtral 8x7B model) [9], llama2 [8], GPT-J [18] respectively. For the question-answering part, the prompt of system 1 contains no context, whereas the rest of the systems use snippets as the context. The LLM models used for systems 1,2,3,4,5 are Mistral 7B, Mistral 7B, Mixtral 47B, llama2 and GPT-J respectively. Table 6 shows the information retrieval results for Synergy task round 4. System 3 has the best performance with 0.0434 mean-average precision (MAP) score for document retrieval and 0.031 for snippet retrieval. The performance of systems 2,4 and 5 is similar in MAP score, in the range between 0.02 and 0.03, whereas system 1 is the worst with MAP score of 0.0003. The systems used for the Synergy task only perform the basic first-level retrieval by fetching 10 records from PubMed using a query concatenated by keywords. We can see that Mixtral47B outperforms other systems for the question keyword extraction task. The spaCy language model en_ner_bc5cdr_md performs the worst. The reason is that the en_ner_bc5cdr_md model is often unable to detect any keywords in the question body since it is limited to detecting only the disease and chemical entities in the sentence. Table 6 Synergy Round 4 Information Retrieval Results Document Document Snippet System Snippet MAP Macro-F1 MAP Macro-F1 top competitor 0.1459 0.1308 0.0982 0.1003 system1 0.001 0.0003 0.0014 0.0003 system2 0.0283 0.027 0.0146 0.0215 system3 0.0578 0.0434 0.034 0.0317 system4 0.0355 0.0258 0.0225 0.0249 system5 0.0353 0.0267 0.0273 0.0237 Table 7 Synergy Round 4 Question Answering Results Ideal Answer Manual System Yes/no F1 Factoid MRR List F1 Score (total out of 20) top competitor 1.0 0.43 0.27 17.41 system1 0.33 0.1818 0.0806 13.56 system2 0.4 0.3182 0.2009 14.91 system3 0.625 0.2727 0.2256 16.4 system4 0.4857 0.2727 0.1919 11.55 system5 0.4857 0.1364 0.1467 9.05 Table 7 shows the question-answering results for Synergy task round 4. Our best-performing system achieved 0.62 F1 score for yes/no questions, 0.3182 on mean reciprocal rank (MRR) metric for factoid questions and 0.22 F1 score for list questions. The Mixtral47B model gives the best performance on all question types except for factoid questions. For factoid questions, system 2 with Mistral 7B model performs the best. Even though the Llama2 model is much larger in model size, it does not perform as well as the Mixtral/Mistral models. By comparing the results of system1 and system2, we can see that adding context to the prompt greatly improves the QA answering accuracy for all types of questions for the same LLM model engine. The ideal answer is evaluated by manual scores according to the readability, recall, precision and repetition of the answers. Mixtral 47B performs the best in terms of generating long answers. The Llama2 model performs the worst in generating long answers, even though it is on par with other models in exact answer generation. It seems to indicate that the Llama2 model is not good at generating long answers, which could be due to the model architecture differences. It is also possible that the Llama2 model was trained to give vague answers to biomedical questions such that it does not mislead the users on health issues. 4.2. BioASQ Task 12B For Task 12B, we added the direct query generation method to our experiment. We updated the re-ranker component to get the top 10 documents among the top 30 documents retrieved from the first stage via the PubMed Query for system 1 in batch3. In addition, we enhanced the system by response resampling and adding a fallback to use a query with the original question if LLM keyword extraction fails to generate any keyword or the query generated by GPT-4 [7] returns no results. 4.2.1. Task 12B Phase A For the three systems we submit to the PhaseA of task 12B, system 1 uses the direct query generation method with GPT-3.5 for batch 1 and GPT-4 for batch 2 and 3. Systems 2 and 3 continue to use the keyword extraction method with Mistral 7B and Mixtral 47B as before. The system configurations are outlined in Table 8. Table 8 12B System Configurations Leader-board Phase A+ QA System Phase A IR method Phase B QA strategy Name context Golden snippet (add Gatech competi- Query generation with synonym grouping system1 IR snippet tion GPT3.5 or GPT-4 for list question in batch 2 and 3) Keyword extraction with system2 GTBioASQsys2 No snippet Golden snippet Mistral 7B Keyword extraction with system3 GTBioASQsys3 IR snippet - Mixtral 47B In Table 9, all three systems in batch 2 have similar performance in terms of MAP score at around 0.081 for document retrieval, with system 3 having the best performance. For snippet retrieval, system 3 has the best performance with MAP score of 0.0271, followed by system 2 and system 1. The system 1 performance also improved from 0.0497 in batch 1 to 0.081 in batch 2, after switching from using GPT-3.5 to GPT-4 for query generation. In batch 3, system 1 had a significant improvement of MAP to 0.1385 thanks to increasing the number of articles retrieved from PubMed in the initial retrieval stage from 10 to 30. This allows for more articles to be processed in the reranking stage and results in higher recall overall. In past competitions, solutions that use BM25 models for retrieval fetch hundreds of documents in the initial stage of retrieval [19], these systems also tend to have the best score for IR task. We hypothesize that our systems, which use LLM for the retrieval stage, would have even better performance should the number of articles retrieved initially be increased further to 100. However, due to the time required to fetch the articles from PubMed API, extract the snippets, and score the articles for similarity with the query, increasing the number of retrieved document results in a long wait time. Table 9 12B Phase A Information Retrieval Results Batch System Document MAP Snippet MAP top competitor 0.2067 0.1149 system1 0.0497 0.0150 1 system2 0.0599 0.0050 system3 0.0717 0.0204 top competitor 0.2293 0.1539 system1 0.081 0.0151 2 system2 0.0826 0.0261 system3 0.0836 0.0271 top competitor 0.2549 0.2224 system1 0.1385 0.0505 3 system2 0.0883 0.0312 system3 0.0929 0.0271 4.2.2. Task 12B Phase A+ and Phase B For the question answering (QA) part, we enhanced our system by adding resampling if the exact answer does not satisfy the requirements. For example, neither "yes" nor "no" is in the answer for yes/no question. We also experimented with setting up different contexts for the QA system, by changing the number of snippets and the variety of the snippets used. We submit three systems to PhaseA+. Phase A+ system1 uses the snippets generated by system1 in PhaseA as the context for the QA prompt. Phase A+ system 2 does not use any snippet as the context for the QA prompt. Phase A+ system 3 uses the snippets generated by system 3 in PhaseA as the context. In PhaseB, we use the golden snippet provided by the organizer as the context of the QA prompt for PhaseB system1 and system2. The difference is that PhaseB system1 performs synonym grouping for list questions, whereas PhaseB system2 does not. Table 10 shows the results of all the five systems in PhaseA+ and PhaseB. Take batch 2 as an example, the system without context (Phase A+ system2) only achieved 0.69 F1 score for yes/no questions. Adding context improves the F1 score to 0.80 for yes/no questions and adding golden snippets as the context further improves the F1 score to 0.96 for yes/no questions. For factoid questions, adding non-golden snippets as context does not improve the MRR score, but adding golden snippets as context improves the MRR score from 0.21 to 0.36. For list questions, adding non-golden snippets as context improves the F1 scores slightly, and adding golden snippets as context further improves the F1 score from 0.21 to 0.50. Batch 1 and 3 results also follow the same pattern. Table 10 12B Question Answering Results Batch System Yes/no F1 Factoid MRR List F1 Phase A+ top competitor 0.9167 0.3611 0.4808 Phase A+ system1 0.8397 0.1429 0.3661 Phase A+ system2 0.7987 0.0952 0.2350 1 Phase A+ system3 0.8397 0.1429 0.1866 Phase B top competitor 0.9589 0.4286 0.5843 Phase B system1 0.8792 0.1905 0.3975 Phase A+ top competitor 0.9601 0.3947 0.4088 Phase A+ system1 0.8051 0.2105 0.2133 Phase A+ system2 0.6905 0.2105 0.1268 2 Phase A+ system3 0.8003 0.2105 0.1364 Phase B top competitor 0.9601 0.6842 0.5708 Phase B system1 0.9601 0.3684 0.2936 Phase B system2 0.8802 0.358 0.4964 Phase A+ top competitor 0.9143 0.3301 0.3708 Phase A+ system1 0.7822 0.2308 0.1774 Phase A+ system2 0.6643 0.0769 0.2098 3 Phase A+ system3 0.7483 0.1538 0.1860 Phase B top competitor 1.0 0.5 0.6484 Phase B system1 0.8748 0.3077 0.3399 Phase B system2 0.9577 0.3846 0.4763 5. Discussion The Task12B results show that our systems with golden snippets as the context can achieve an F1 score of 0.87-0.96 for yes/no questions. We improved post-processing steps for factoid and list questions after batch1, by removing duplicate answers and detecting malformed answers. Therefore the MRR score of our final system for factoid question is in the range of 0.3-0.4. The F1 score for list question is 0.45-0.5. By comparing the list F1 score of PhaseB system1 and system2 in batch 2 and 3, we can see that synonym grouping performs worse than not using synonym grouping. To understand the reasons, we looked at some of the synonym grouping responses from LLM and found that LLM often groups entities that should be in different categories together. Table 11 shows an example prompt and response pair. We can see that "fibromyalgia" and "chronic fatigue syndrome" are grouped as synonyms, and "depression" and "hypermobility spectrum disorders" are grouped as synonyms, whereas they should all be separate entities. As a result, the synonym grouping does not help the list question performance. This also demonstrated that adding second-stage reasoning using LLM does not always give better results for complex problems. Our key takeaways from the experiments are: Table 11 Example Response from LLM to Synonym Grouping Prompt Grouping Prompt: Group the phrases with the same meaning in the ENTITY list into separate lines as follows. (examples) [ENTITY]: fibromyalgia; chronic fatigue syndrome; autosomal dominant polycystic kidney disease; Marfan syndrome; Cutis laxa syndromes; depression; osteogenesis Imperfecta Type 1; Loey-Dietz syndrome; hypermobility spectrum disorders [GROUP1]: Grouping Response: fibromyalgia; chronic fatigue syndrome [GROUP2]: autosomal dominant polycystic kidney disease; Marfan syndrome; osteogenesis Imperfecta Type 1; Loey-Dietz syndrome [GROUP3]: Cutis laxa syndromes [GROUP4]: depression; hypermobility spectrum disorders 1. For IR part, Mixtral 47B is the best-performing model for question keyword extraction among the models we have tested. Retrieving more documents in the initial retrieval stage can improve the performance of the system. 2. For QA part, adding context, especially using "correct" snippets as the context, to the prompt can greatly improve the QA answering accuracy for all types of questions. 3. The improvements of the QA scores in batch 2 and 3 in Task 12B demonstrate that resampling LLM response is a great technique to improve accuracy. Simple response post-processing steps to validate the output format can also improve the performance. 4. By comparing the results of Llama2 and other models in the Synergy task, we found that Llama2 model is not good at generating long answers, even though it is on par with other models in exact answer generation. 5. Two-stage LLM reasoning does not always give better results for complex problems, as shown by the synonym grouping experiment. 6. Future Work Here are some ideas for future work to improve the performance of our systems. For the IR part, currently, we only fetch a small amount of documents from PubMed and use embeddings to rank the documents. We found that increasing the number of documents fetched in the initial retrieval stage improves the recall and overall MAP score but leads to long processing time. Calculating embeddings on the fly is especially time-consuming. In the future, we want to embed all the PubMed documents in advance and store the embeddings in a vector database. In this way, we can fetch more documents in the first stage retrieval for second stage reranking as we would be able to look up the embeddings for a specific document quickly. We can also use similarity search on the vector database to directly fetch relevant documents for the question. When calculating the similarity between the question and the document, we only use the first part of the document, which fits the embedding model input token size. We want to investigate if splitting the documents into multiple parts and calculating the similarity for each part can improve the performance. We can also explore the performance of using different sentence embedding models. For the QA part, our current system is based on the few-shot examples to guide the LLM to generate answers. We only used a few training examples in the BioASQ dataset and have not utilized the potential of the BioASQ dataset. The next step would be to fine-tune the pretrained LLM model (specifically Mixtral47B) on the BioASQ dataset. The experience of crafting examples for prompt engineering can help us prepare training data for fine-tuning LLM. We will consider using Low Rank Adaptation (LoRA) [20] as a cost-effective method for finetuning a model with a large number of parameters. 7. Conclusions We implemented information retrieval and question-answering systems for the BioASQ Task 12b and Synergy tasks. The information retrieval system uses pretrained LLM and prompt engineering to search documents and uses sentence embeddings to rank documents. The question answering system uses in-context few-shot examples to guide the LLM to generate answers while passing article snippets as context. Our final system incorporates several useful techniques such as resampling and response post-processing for LLM interaction. We experimented with various state-of-the-art LLM models, compared their performance and found that Mixtral 47B is overall the best-performing model. Our best-performing system achieved 0.14 MAP score on document retrieval, 0.05 MAP score on snippet retrieval, 0.96 F1 score for yes/no questions, 0.38 MRR score for factoid questions, and 0.50 F1 score for list questions in Task 12b. We hope this work can provide insights for future research in building biomedical question answering systems using large language models. Acknowledgements Thank you to the Data Science @ Georgia Tech (DS@GT CLEF) team and Anthony Miyaguchi for their support. We acknowledge the use of Grammarly [21] to proofread this paper. References [1] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [2] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [3] Pubmed overview (2023). URL: https://pubmed.ncbi.nlm.nih.gov/about/. [4] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for Biomedical Question Answering, Scientific Data 10 (2023) 170. [5] Q. Chen, H. Sun, H. Liu, Y. Jiang, T. Ran, X. Jin, X. Xiao, Z. Lin, H. Chen, Z. Niu, An extensive benchmark study on biomedical text generation and mining with ChatGPT, Bioinformatics 39 (2023) btad557. URL: https://doi.org/10.1093/bioinformatics/btad557. doi:10.1093/bioinformatics/ btad557. [6] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt, 2023. doi:10.48550/ arXiv.2302.11382. [7] OpenAI, Gpt-4 a large-scale transformer-based language model, OpenAI (2023). URL: https: //chat.openai.com. [8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, Llama open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023). URL: https://arxiv.org/abs/2302.13971. [9] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024). URL: https://arxiv.org/abs/2401.04088. [10] A. Nentidis, G. Katsimpras, A. Krithara, S. L. López, E. Farré-Maduell, L. Gasco, M. Krallinger, G. Paliouras, Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering (2023). URL: https://arxiv.org/abs/2307.05131. [11] E. Landhuis, Scientific literature: Information overload, Nature 535 (2016) 457–458. URL: https://www.nature.com/articles/nj7612-457a. doi:https://doi.org/10.1038/nj7612-457a, published online 20 July 2016. [12] S. Ateia, U. Kruschwitz, Is chatgpt a biomedical expert? exploring the zero-shot performance of current gpt models in biomedical tasks, CEUR Workshop Proceedings 3497 (2023). URL: https://ceur-ws.org/Vol-3497/paper-006.pdf. [13] H. Kim, H. Hwang, C. Lee, W. Y. Minju Seo, J. Kang, Exploring approaches to answer biomedical questions: From pre-processing to gpt-4 (2023). URL: https://ceur-ws.org/Vol-3497/paper-011.pdf. [14] E. Sayers, A general introduction to the e-utilities, National Center for Biotechnology Information (US) (2009). URL: https://www.ncbi.nlm.nih.gov/books/NBK25497/. [15] M. Neumann, D. King, I. Beltagy, W. Ammar, ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing (2019) 319–327. URL: https://www.aclweb.org/anthology/W19-5034. doi:10.18653/v1/W19-5034. arXiv:arXiv:1902.07669. [16] H. Face, sentence-transformers/all-minilm-l6-v2, Hugging Face Community week (2021). URL: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. [17] Textsynth documentation (2024). URL: https://textsynth.com/documentation.html. [18] B. Wang, A. Komatsuzaki, GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model (2021). [19] G. H. Maël Lesavourey, Bioasq 11b: Integrating domain specific vocabulary bert-based model for biomedical document reranking, Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023) (2023). URL: https://ceur-ws.org/Vol-3497/. [20] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank adaptation of large language models (2022). URL: https://openreview.net/forum?id=nZeVKeeFYf9. [21] Grammarly, Grammarly handbook (2024). URL: https://www.grammarly.com/handbook/. A. Prompt Templates Table 12 Prompt Template for List Questions Context: The FGFR3 P250R mutation was the single largest contributor (24%) to the genetic group; Syndromic craniosynostosis due to complex chromosome 5 rearrangement and MSX2 gene triplication Question: Which human genes are more commonly related to craniosynostosis? Ideal answer: The genes that are most commonly linked to craniosynostoses are the members of the Fibroblast Growth Factor Receptor family FGFR3 and to a lesser extent FGFR1 and FGFR2. Some variants of the disease have been associated with the triplication of the MSX2 gene and mutations in NELL-1. NELL-1 is being regulated bu RUNX2, which has also been associated to cases of craniosynostosis. Other genes reported to have a role in the development of the disease are RECQL4, TWIST, SOX6 and GNAS. Exact answer: FGFR3;FGFR2;FGFR1;MSX2;NELL1;RUNX2;RECQL4;TWIST;SOX6;GNAS ### Context: The current article presents a concise review of network theory and its application to the characteri- zation of AED use in children with refractory epilepsy; Recent results suggest that LCM has a dual mode of action underlying its anticonvulsant and analgesic activity. Question: What are the main indications of lacosamide? Ideal answer: Lacosamide is an anti-epileptic drug, licensed for refractory partial-onset seizures. In addition to this, it has demonstrated analgesic activity in various animal models. Apart from this, LCM has demonstrated potent effects in animal models for a variety of CNS disorders like schizophrenia and stress induced anxiety. Exact answer: refractory epilepsy;analgesic;CNS disorders ### Context: {context} Question: {body} Ideal answer: Table 13 Prompt Template for Factoid Questions Context: Ewing sarcoma is the second most common bone malignancy in children and young adults. It is driven by oncogenic fusion proteins (i.e. EWS/FLI1) acting as aberrant transcription factors that upregulate and downregulate target genes, leading to cellular transformation; Ewing sarcoma/primitive neuroectodermal tumors (EWS/PNET) are characterized by specific chromosomal translocations most often generating a chimeric EWS/FLI-1 gene Question: Which fusion protein is involved in the development of Ewing sarcoma? Ideal answer: Ewing sarcoma is the second most common bone malignancy in children and young adults. In almost 95% of the cases, it is driven by oncogenic fusion protein EWS/FLI1, which acts as an aberrant transcription factor, that upregulates or downregulates target genes, leading to cellular transformation. Exact answer: EWS;FLI1 ### Context: Acrokeratosis paraneoplastica of Bazex is a rare but important paraneoplastic dermatosis, usually manifesting as psoriasiform rashes over the acral sites Bazex syndrome (acrokeratosis paraneoplastica): persistence of cutaneous lesions after successful treatment of an associated oropharyngeal neoplasm. Question: Name synonym of Acrokeratosis paraneoplastica. Ideal answer: Acrokeratosis paraneoplastic (Bazex syndrome) is a rare, but distinctive paraneoplastic dermato- sis characterized by erythematosquamous lesions located at the acral sites and is most commonly associated with carcinomas of the upper aerodigestive tract. Exact answer: Bazex syndrome ### Context: {context} Question: {body} Ideal answer: Table 14 Prompt Template for Summary Question Context: Hirschsprung disease (HSCR) is a multifactorial, non-mendelian disorder in which rare high- penetrance coding sequence mutations in the receptor tyrosine kinase RET contribute to risk in combination with mutations at other genes. Question: Is Hirschsprung disease a mendelian or a multifactorial disorder? Answer: Coding sequence mutations in RET, GDNF, EDNRB, EDN3, and SOX10 are involved in the devel- opment of Hirschsprung disease. The majority of these genes was shown to be related to Mendelian syn- dromic forms of Hirschsprung’s disease, whereas the non-Mendelian inheritance of sporadic non-syndromic Hirschsprung disease proved to be complex; involvement of multiple loci was demonstrated in a multiplicative model. ### Context: Significant decreases and progressively higher reductions of systolic and diastolic blood pressure were observed after RSD. The RSD presents itself as an effective and safe approach to resistant hypertension. Question: What are the outcomes of Renal sympathetic denervation? Answer: Renal sympathetic denervation reduces heart rate, systolic and diastolic blood pressure. ### Context: {context} Question: {body} Answer B. Example LLM Prompt and Response Table 15 Yes/No Question - Example Prompt and Response Prompt (examples) Context: No significant differences were observed, but FOLFOXIRI-bevacizumab achieved numerically superior survival outcomes versus FOLFOX-panitumumab. These observations suggest that although doublet chemotherapy plus anti-EGFRs remains the preferred treatment in patients with left-sided RAS/BRAF wild-type mCRC, FOLFOXIRI-bevacizumab is a valuable option able to provide similar, if not better, outcomes at the price of a moderate increase in toxicity and may be adopted based on patients’ preference and potential impact on quality of life Question: Are FOLFIRINOV plus Bevacizuma the preferred therapy for left-sided RAS wild type advanced colorectal cancer with aggressive clinical presentation? Ideal answer: Response Yes, based on the observed survival outcomes, doublet chemotherapy plus anti-EGFRs remains the preferred treatment in patients with left-sided RAS/BRAF wild-type mCRC, but FOLFIRINOV plus Bevacizumab can be considered based on patients’ preference and potential impact on quality of life. Exact answer: Yes Table 16 List Question - Prompt and Sample response Prompt (examples) Context: Recent studies have indicated associations between aberrant behavior of amyloid-beta and tau proteins and various neurological diseases, such as Alzheimer’s disease, Parkinson’s disease, and amyotrophic lateral sclerosis, as well as retinal neurodegenerative diseases like Glaucoma and age-related macular degener- ation. Additionally, these proteins have been linked to cardiovascular disease, cancer, traumatic brain injury, and diabetes. Question: Amyloid-𝛽 is associated with what diseases? Ideal answer: Response Amyloid-𝛽 is associated with Alzheimer’s disease, Parkinson’s disease, amyotrophic lateral sclerosis, Glaucoma, age-related macular degeneration, cardiovascular disease, cancer, traumatic brain injury, and diabetes. Exact answer: Alzheimer’s disease; Parkinson’s disease; amyotrophic lateral sclerosis; Glaucoma; age-related macular degeneration; cardiovascular disease; cancer; traumatic brain injury; diabetes