1. Introduction

BIT.UA at BioASQ 11B: Two-Stage IR with Synthetic Training and Zero-Shot Answer Generation

Tiago Almeida

tiagomeloalmeida@ua.pt 0

Richard A. A. Jonker

richard.jonker@ua.pt 0

Roshan Poudel

proshan@ua.pt 0

Jorge M. Silva

jorge.miguel.ferreira.silva@ua.pt 0

Sérgio Matos

aleixomatos@ua.pt 0 0 IEETA/DETI, LASI, University of Aveiro , Portugal

This paper presents the eforts of the Biomedical Informatics and Technologies (BIT) group at the University of Aveiro in the eleventh edition of the BioASQ challenge. This paper presents our eforts in the eleventh edition of the BioASQ challenge. We addressed Task B in its two phases: document retrieval (phase A) and question answering (phase B). In phase A, we utilized a sparse retrieval method for initial document retrieval, implemented using Anserini, followed by a re-ranking step using transformer models, including monoT5 and PubMedBERT. Phase B featured the application of large language models (LLMs) to generate answers to questions based on a relevant article, with models such as Alpaca-LoRA, OA-Pythia, and OA-LLaMA. We also explored a variety of prompts and question types, as well as diferent generation strategies to optimize our system's performance. Our systems, in phase A, achieved competitive results scoring at the top and close to the top for all the batches, and achieving the best results in terms of F1 for all the batches. Regarding the phase B, our systems underperformed according to the automatic measures. Code to reproduce our submissions is available at https://github.com/ieeta-pt/BioASQ_11B.

eol>Information Retrieval Dense Retrieval Language model Answer Generation

1. Introduction

The realm of biomedical literature has been experiencing an exponential increase, predominantly driven by the rise in open-access and peer-reviewed publications. This rapid expansion results in an information overload, posing a significant challenge to researchers, physicians, and other healthcare practitioners [ 1 ]. As delineated by Klerings et al. [ 1 ], the primary concern stems not from the abundance of information but the scarcity of sophisticated information retrieval systems proficient in managing this growing body of literature. To mitigate this, the BioASQ challenge is a yearly competition that stimulates the creation of intelligent retrieval systems. In its eleventh year, the BioASQ challenge [ 2, 3 ] comprises several tasks targeting unique facets of information retrieval and text mining within the biomedical domain.

Task B and the Synergy task emphasises information retrieval and question-answering. Task B bifurcates into phases A and B. Phase A involves identifying relevant documents or snippets that answer a biomedical question, while phase B addresses the extraction and generation of responses. These tasks collectively aim at advancing systems that provide evidence or answers to open-ended biomedical queries. In contrast, the Synergy task seeks to resolve open-ended questions about COVID-19 by leveraging IR and QA systems.

This paper describes our participation in Task B phase A and ideal answer in phase B of the BioASQ challenge. During phase A, we utilized the traditional BM25 [ 4 ] for base document retrieval, followed by document re-ranking executed via a variety of transformer models, including monoT5 [ 5 ] and PubMedBERT [ 6 ]. These models were fine-tuned on prior years’ data, and synthetic data generation was employed to mitigate the constraints of a small dataset size. During phase B, we adopted a naive unsupervised approach where language models were prompted to generate answers to a question provided with a article as context. The approach involved exploring various models and prompts along with difering context selections. Figure 1 shows an illustration of an end-to-end pipeline for information retrieval and answering system.

Following this introduction, Section 2 explains the related work. Section 3 is the methodological section, where we explore the used datasets and corpora and thoroughly illustrates the employed methodologies. Section 4 shows our results and section 5 discusses them. The paper concludes in Section 6, summarising the key findings of our participation, with a brief discussion of future work in Section 7.

2. Related Work

The BioASQ challenge has consistently catalyzed significant advancements in biomedical information retrieval and question-answering. Task B, in particular, encapsulates the essence of these complex processes, focusing on two fundamental fields: Information Retrieval (IR) and Question Answering (QA).

Fundamentally, IR (phase A) aims to identify and retrieve relevant documents or snippets that align with a posed biomedical question, thereby addressing the issue of locating pertinent information within the vast corpus of biomedical literature [ 2 ]. QA (phase B), on the other hand, is concerned with extracting and generating comprehensive answers from the retrieved information. This intricate process requires understanding the question at hand and determining the most suitable answer by leveraging the context provided by the retrieved documents.

In the latest competition, the state-of-the-art performances were achieved by systems that utilized a two-step process: an initial sparse retrieval system followed by a Transformer-based re-ranking model [ 7 ]. This approach was not unique to a single submission but was rather a common thread among various entries. Our previous work Almeida et al. [ 8 ] also employed a similar pipeline, that used BM25 as first-stage, and in the second stage, employing powerful models such as PubMedBERT [ 6 ] and UPWM [ 9 ]. These models have shown remarkable proficiency in interpreting intricate biomedical queries and matching it to a relevant article.

2.1. Information Retrieval

Information Retrieval (IR) involves identifying relevant documents that match a specific query. IR can be broadly categorized into sparse retrieval and dense retrieval. Sparse retrieval, usually associated with more traditional approaches, involves converting text into an inverted index to enable fast searching. An inverted index stores a mapping of terms to documents. Sparse retrieval has the advantage that it is fast and explainable. The simpler approach of sparse retrieval includes Bag-of-Words and term frequency-inverse document frequency (tf-idf). There are also sparse retrieval techniques that are enhanced by transformer-based models such as DeepCT [ 10 ] and HDCT [ 11 ] which produces contextualized term weights that can be stored in traditional inverted indexes. Nevertheless, one of the most relevant and well-known algorithms used in sparse retrieval is BM25 [ 4 ].

BM25 = ∑︁ ⎛

tf(, ) · (1 + 1) ⎝ tf(, ) + 1 · (1 − + · a|vgd|l ) · ln ︂( − df() + 0.5 )︂ df() + 0.5 ⎞ ⎠ .

Where tf(, ) represents the term frequency of term () in the document (), || represents the length of the documents, avgdl is the average length of a document in the collection, is the number of documents in the collection and tf() is the number of documents containing term . 1 and are hyperparameters that can be tuned.

On the other hand, a more recent approach called dense retrieval has emerged, utilizing transformer models to convert both documents and queries into the same dimensional space [ 12 ]. In this approach, the query is transformed into a vector representation by the dense retrieval model. The search process involves comparing the similarity of the query vector against all the document vectors that have been previously encoded. Prominent approaches in this domain include DPR [ 13 ] and ANCE [ 14 ], which employ transformer-based models to learn a joint dimensional space for projecting queries and documents in a meaningful way. This enables queries to be closer in dimensional space to their relevant documents. To facilitate eficient execution of this type of search, libraries like Facebook’s FAISS [ 15 ] ofer a comprehensive framework designed specifically for this purpose.

Both the dense retrieval and sparse retrieval techniques can be broadly classified as representation-based approaches. In this approach, the document and query are encoded separately, and the search is performed based on either similarity measures (dense retrieval) or cumulative scores (sparse retrieval). In contrast, interaction-based approaches jointly score the query and document, allowing for the extraction of more intricate matching patterns and potentially improving retrieval results. However, due to the need to score the query against every document in the collection, interaction-based approaches are not practical for searching the entire document corpus. Therefore, representation-based approaches are commonly adopted as first-stage retrieval techniques to reduce the search space. Subsequently, more powerful interaction-based techniques can be employed to further refine the ranking order, a process known as re-ranking in the literature. These models are typically trained using pointwise and pairwise techniques [ 16 ]. Pointwise learning involves assigning a score to each document, and the ranking is then performed by sorting these scores. On the other hand, pairwise learning involves comparing pairs of documents and enforcing a margin between positive and negative document pairs, leading to a more discriminative learning process.

2.2. Question Answering

Question Answering(QA) aims to provide accurate and relevant answers to various questions. QA tasks can be generally divided into two main categories: • Extractive QA involves identifying and extracting an answer from the given context. • Generative QA requires the model to generate an answer freely, sometimes requiring a context.

Generative QA can further be divided into open and closed generative QA. In open generative QA, the text is generated using a context provided. This is not to be confused with open-domain QA. Closed generative QA has no context, and the model entirely generates the answer.

More recent generative QA approaches leverage large language models (LLMs) for zero-shot answer generation. In this setup, the model is provided with a query containing the context and asked to generate an answer. This approach is relatively new in the literature. GPT-3 [ 17 ] is a powerful autoregressive language model that uses deep learning to produce human-like text. It has 175 billion parameters and has been applied successfully in zero-shot tasks that require a deep understanding of context, making it a suitable choice for generative QA tasks.

Other recent LLMs have surfaced, such as LLaMA [ 18 ], Alpaca [ 19 ] and Pythia [ 20 ]. LLaMA is a foundation LLM that is based on various transformer-based architectures, namely GPT-3 [ 17 ], PaLM [ 21 ], and GPTNeo [ 22 ]. Alpaca is a LLM based on LLaMA that was fine-tuned on the text generated by OpenAi’s GPT-3.5. Using this technique of knowledge distillation, LLMs can be made much smaller without sacrificing too much performance. Alpaca-LoRA 1 employs an approach known as Low-Rank Adaptation [ 23 ], which keeps the pre-trained model weights constant and introduces trainable rank decomposition matrices at each layer of the Transformer architecture. This significantly reduces the number of trainable parameters for downstream tasks. Pythia is a library for Transformers, providing various pre-trained models, which are also GPT based. OpenAssistant [ 24 ] fine-tuned Pythia and LLaMA models on human-labelled datasets to boost the models’ performance and create an open-source competitor to ChatGPT.

1https://github.com/tloen/alpaca-lora 3. Methodology

The methodology section commences with a comprehensive overview of the corpora and the dataset used in each task. Subsequently, it details the methods employed for each task we participated in.

3.1. Corpora and Dataset

For Task B, we were provided with a dataset containing data from the first ten editions of the challenge. The dataset included 4719 questions, categorized as 1417 ‘factoid’, 1271 ‘yesno’, 1130 ‘summaries’, and 901 ‘lists’. Each question was accompanied by its relevant documents, snippets, concepts, RDF triples, and exact and ideal answers. To construct our corpus, we utilized the PubMed annual baseline document collections spanning from 2013 to 2023. This corpus consisted of the abstracts and titles of all documents. The most recent PubMed baseline collection (2023) contains approximately 35 million documents. However, we encountered a challenge due to the dynamic nature of the documents. Each year, documents are updated or removed, which means that the relevant documents for a question in the first edition may no longer be present in the document collection for the current edition. This posed a problem when relying solely on the latest baseline collection to extract the title and abstract for accurate querying. To address this issue, we augmented each question with the year it appeared in, enabling us to query the relevant documents more precisely.

Additionally, we encountered some documents that were missing titles, abstracts, or both. This could be due to licensing or linguistic issues. We addressed this by removing these incomplete documents from the collection. Afterwards, we created sparse Anserini [ 25 ] indexes for each year. Having yearly indexes proved advantageous as it allowed us to search for relevant documents specific to the year in which a question appeared. This approach enhanced the accuracy of retrieving pertinent information for each question.

Regarding the question dataset, there were cases where questions were repeated or were very similar although having a diferent set of relevant documents. Due to this fact, we decided to merge similar questions by merging the set of relevant articles to enrich the training data. To accomplish this, we leveraged the pre-trained SimCSE [ 26 ] model to compute the similarity between questions. Then, questions with a similarity score above 99% were automatically merged, while questions with a similarity score between 90% and 99% were manually reviewed. Another additional step was to remove the questions before BioASQ 4. During these years of the challenge, the systems were able to use the full-text article from PubMed Central (PMC) to make judgments. This will lead to situations where the model does not have the necessary content to make a correct prediction for these document pairs. At the end of this process, the number of resulting question were 3465 (-30%) totalling, 25781 question-documents positive pairs. In order to build a training dataset for neural relevance models, we need to also gather negative question-document pairs, such that the model can learn how to correctly score relevant and irrelevant documents. To accomplish this, we performed random sampling over the list of documents provided by the BM25 that were not positives for a given question. This should result in a list of strong negative documents. 3.1.1. Synthetic Question Generation Data quality and quantity are crucial for developing strong, efective models in deep learning for information retrieval and relevance determination. In the previous section, we describe our pre-processing steps to increase the quality of the gold standard data. However, we are still missing in terms of data quantity. We propose generating questions by transformer-based language models to create a synthetic dataset that can be used to pre-train the relevance models to first learn basic retrieval patterns.

To synthetically generate a question for a given article, we fed an engineered prompt that tries to condition a language model to generate a question based on the information contained in the article. More formally, we empirically built the prompt, = {1, ..., }, such that a language model would maximize the probability of a question, = {1, ..., }, being sampled according to Equation 1.

∼ ∏︁ (|1, ..., , 1, ..., − 1) =1

In this work, we mainly used zero-shot question generation since we did not explore training the language models to generate questions based on the BioASQ data. To further guide the language model into generating useful questions, we also included a question starting word as part of the prompt, such that the model will be forced to pick the following words conditioned on that starting word. Some examples of words that start a question are {What, Which, Is, List, Are, Does}2, Prompt 1 shows the prompt that we adopted for generating a question in a zero-shot fashion with OA-pythia 12B model. (1) <|prompter|>Given the following context \"{article}\", generate a question that can be answered by the information provided in the context: <|endoftext|><|assistant|>What Prompt 1: Example of the last prompt used to generate synthetic questions with OA-pythia model

Regarding the language models that we used, we tried with small language models like, GPT-Neo-125M [ 22 ] and also with the larger ones such as OA-pythia-12B [ 20, 24 ] model. The synthetic dataset contained 79855 questions that were generated from 15971 randomly sampled articles. 3.2. Phase A Our approach for the phase A of the challenge involved the development of a two-stage retrieval pipeline designed to handle the large volume of biomedical literature with eficiency. Figure 1 presents the overview of our two-stage retrieval pipeline. 2These words follow the distribution of the starting words that appear in the BioASQ dataset.

At first, we utilized a sparse retrieval method. To accomplish this, we constructed an inverted index, a commonly used data structure in information retrieval that maps terms to the documents that contain them, using Anserini, a powerful retrieval toolkit built on Lucene [ 27 ]. For compatibility with our Python-based pipeline, we used Pyserini, Anserini’s Python wrapper [ 28 ].

For document retrieval, we adopted the BM25 ranking function, which is widely recognized for its efectiveness [ 29 ]. We selected the top 100 documents based on BM25 scores as the initial retrieval result and occasionally extended them to the top 1000 for broader coverage. Figure 3 illustrates that extending from the top 100 to the top 1000 documents increases the number of expected documents in the set by 20% (from 71% to 91% recall). This extension provides a higher chance of retrieving more relevant documents. However, it comes with a trade-of in speed, as the neural retrieval system needs to process ten times more documents. It is worth noting that if the top 100 documents already contain a suficient number of positive documents, using the top 1000 may not yield significant gains in metrics. This observation will be later addressed in the discussion section.

0.9 ll 0.8 a ceR0.7 0.6 0.5 0 500 1000 1500

2000 2500 3000 3500 Size of the document retrieved set 4000 4500 5000

The parameters for the BM25, specifically and 1, were selected through a preliminary hyperparameter tuning process. Figure 4 shows a summary of all the runs and their respective parameters. Based on this we adopted the parameters 1 = 0.5 and = 0.3.

In the second-stage, we utilized re-ranking models, which includes state-of-the-art transformer-based models such as PubMedbert [ 6 ] and monoT5 [ 5 ] (both base and large variantes). We also considered the BioGPT [ 30 ] and Pegasus [ 31 ] models, but due to their higher computational cost, they were discarded. These models were trained using both pointwise and pairwise approaches to evaluate their efectiveness in difering scenarios. To expand upon the limited availability of training data, we also experimented with including synthetic data in our training regimen as a pretraining mechanism.

Finally, to consolidate the output from several models, we used the reciprocal rank fusion (RRF) [ 32 ]. This approach acts as an ensemble technique to improve the overall ranking order of the relevant documents by considering the judgment of multiple models. 3.2.1. Submissions The runs submitted to the phase A challenge were ensembles of various trained models with diferent checkpoints. The various systems submitted are briefly described in Table 1. • System-0: This system contained 3 PubMedBERT models that re-ranked 1000 documents fetched from BM25. • System-1: This system contained 5 PubMedBERT models that re-ranked 100 documents fetched from BM25. • System-2: For the first batch, the system contained 2 T5-base and 2 T5-Large models that re-ranked 100 documents from BM25. For the rest of the batches, the system used an ensemble of models trained on synthetic data and then fine-tuned on the challenge data. The models ensembled were 7 PubMedBERT models, 5 of which re-ranked 1000 documents, and the remaining 2 re-ranked 200 documents. • System-3: In the first batch, an ensemble of 2 T5-base models and 2 PubMedBERT models were used to re-rank 100 documents. The following 2 batches investigated pairwise training with synthetic data, where 7 PubMedBERT models were trained using a pairwise loss function. Among them, 4 models re-ranked 100 documents, and 3 re-ranked 500 documents. In the final batch, some models were removed and replaced with models from the first system. • System-4: In the first batch, the system contained 2 T5-Large models and 2 T5-Base models. The Large models re-ranked 1000 documents, and the Base models re-ranked 100 documents. In the remaining submissions, we ensembled most of our trained models, reaching a total of 25 models. However, it should be noted that only 24 models were used in the last batch. 3.3. Phase B

To provide a natural language answer to a question, we adopted an exploratory approach, testing various prompts and models to gauge their efectiveness in generating precise and meaningful answers. Recognizing the varying complexities inherent to diferent question types, we also experimented with per-question type prompting. This approach considers the nature of the question—be it factoid, list, or summary—and tailors the model’s prompt accordingly, enabling more accurate and contextually relevant responses, see Prompt 2 as reference.

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: ABSTRACT: {text} QUESTION: {question} ### Response: Prompt 2: Example of a zero-shot prompt for generation answers with the Alpaca-LoRa model.

The text within brackets correspond to placehorders for the instruction, question and article text. The default instruction was “Given the ABSTRACT, answer the QUESTION”. For yesno type of question we used the following “Given the input ABSTRACT produce a yes or no answer to QUESTION”, while for the summary type we used “Given the input ABSTRACT produce a short and concise answer to QUESTION”.

Context selection should play a large part in the quality of the text generation. We tested this using our top retrieved article from phase A, the top gold standard article, or a combination of both as context for the model. The latter was accomplished by selection the gold standard article that was ranked higher according to our model. A key focus of our experimentation was the application of various advanced language models such as ALPACA-LoRA (13 billion), OAPythia (12 billion) and OA-LLaMA (30 billion) models. Furthermore, we dabbled with diferent answer-generation strategies, including random sampling, beam search and contrastive search. In random search, a random token is selected for the next token following the probability distribution of the model. In beam search, multiple possible continuations at each step are explored based on a predefined beam width, aiming to find the most probable and coherent sequence of words. Contrastive search involves searching for alternative continuations or completions by contrasting diferent options and selecting the most distinctive or interesting one. We also extensively tested diferent hyperparameters for model generation, including temperature and the maximum token length. This experimentation allowed us to fine-tune our models’ performance, leading to more precise and informative answers. 3.3.1. Submissions For Phase B, our submissions consisted of various instruction transformer-based models, each described concisely in Table 2. The “Document Source” column specifies the origin of the article used as context for answer generation. Specifically, “System-0”and “System-4” correspond to the highest scoring documents outputted by the respective phase A system. On the other hand, “Gold” indicates that the document was obtained from the provided gold standard.

More precisely, due to time constraints, for the first batch we selected only a single model for submission using the top ranked document from our best performing model in phase A. This was seen as a naive approach, which is why in further batches we tested with both our models and the gold standard document to answer a question. In the second batch we used the same Alpaca-LoRA model, with the addition of the 30 billion parameter version, tested on the best performing model of the batch and also using the gold standard documents. For the third batch, we were unable to submit 5 submissions due to technical problems, however in this submission we changed the model to OpenAssistant’s Pythia 12 billion parameter model. In the ifnal batch, we additionally tested the OpenAssistant LLaMA model with 30 billion parameters. Regarding the generation strategies, we adopted contrastive search for the Alpaca-LoRA and random sample with high confidence for the OpenAssistant variantes.

4. Results

This section starts by addressing our validation results measured over a subset of the training data. Then we show the oficial preliminary results of the BioASQ challenge for phase A and B. Note that the preliminary results are the results available at time of writing and are due to changes after the reevaluation period. To see the oficial results, use the BioASQ 11B oficial leaderboard3. 3Phase A: http://participants-area.bioasq.org/results/11b/phaseA/, Phase B: http://participants-area.bioasq.org/ results/11b/phaseB/

Model type PubMedBERT PubMedBERT PubMedBERT monoT5-base monoT5-base monoT5-large

BM25 x x x x x x x

4.1. Validation results

The validation of our models was conducted to assess their performance and gain valuable insights in their configuration. In this section, we summarize the validation results obtained over a subset of the training data. More precisely, we performed a stratified train/test split of 95/5 of the dataset, which corresponds to 3292 questions for training and 173 for validation. Taking into consideration that the oficial evaluation batch only contains 90 questions, we believe that our split was representative.

Table 3 summarizes the best validation results of various neural relevance models trained on diferent subsets of data and also the BM25 baseline. The models were evaluated based on their Mean Average Precision at 10 (MAP@10) score, which measures the average precision of the top 10 retrieved documents for each query. Each neural model was trained using diferent combinations of training data, including synthetic and gold standard datasets.

Overall, when training with the gold standard data, all the reranking methods are capable to improve upon the baseline, reinforcing the idea that it is beneficial to adopt a reranking method as a second-stage mechanism of a retrieval pipeline. Regarding the architectures, the PubMedBERT and monoT5-large architectures managed to achieve comparable performances, whilst the monoT5-base architecture achieved considerably poor results. This disparity may be attributed to the fact that monoT5 is a sequence-to-sequence model that directly learns the retrieval task using natural language, placing greater emphasis on the quality of the underlying language model, and, therefore, their size.

Notably, the best configuration we obtained involved training the PubMedBERT model with synthetically generated data and subsequently fine-tuning it with the gold data. This outcome highlights the beneficial impact of incorporating synthetic data.

Furthermore, an unexpected result emerged when comparing the performance of models that were only trained with synthetic data against the BM25 baseline. It was surprising to observe that using only synthetic data yielded improvements over the performance of BM25. This suggests that it is indeed possible to train models without relying on gold data and still achieve superior performance compared to traditional baselines such as BM25. This finding opens up new possibilities for model training and highlights the potential of synthetic data as a valuable resource for retrieval tasks where no labelled data is available. 4.2. Phase A The preliminary results of our submissions are displayed in Table 4, showcasing the rankings based on Mean Average Precision at 10 (MAP@10). Additionally, we provide the results regarding F1-score at 10, ofering insights into the trade-of between precision and recall across the systems. The Top Competitor represents the most successful system among all competitor systems. Overall, we achieved highly competitive results, achieving the best-performing system in the first and second batches in MAP@10 and the best-performing system in all the batches in the F1-score. Significantly, the systems that attained these high F1-scores were relevance models, designed to discard documents if the likelihood of their relevance fell below 1%. Consequently, for questions with less than 10 positive documents, these systems were capable of outputting fewer than 10 documents, thus increasing precision compared to a ranking model that consistently outputs 10 documents regardless of their scores.

Comparing now the performance between the systems, the initial two, namely System-0 and System-1, were employed to study the diference between re-ranking 1,000 and 100 documents. An analysis of these models’ results across various batches revealed that the performance was not significantly afected by the increase in re-ranked documents. This observation will be further revisited in the subsequent discussion section.

Upon evaluating the remaining systems for the first batch, it was discerned that the utilization of T5 models did not significantly enhance performance compared to the BERT models. This observation carries substantial importance, especially given that the inference time for T5 models exceeded that of the BERT-based models. Consequently, the decision was taken to cease the deployment of T5 models in subsequent submissions, favouring instead the more eficient BERT models, which delivered adequate performance. Furthermore, System-4 for the initial batch demonstrated an unexpectedly lower Mean Average Precision (MAP) compared to the outcomes of other ensemble methods. This indicates that the specific configuration or ensemble of models in System-4 did not yield the anticipated results.

Furthermore, upon comparing System-2 and System-3, the distinctive variance can be traced back to the training technique deployed. It was deduced that pairwise training slightly underperformed compared to pointwise training methods. As a consequence, only pointwise training was used in the final batch.

Turning to the final system, System-4, it was observed that ensembling more models consistently outperformed the other systems in all instances. Again, this is an anticipated result corroborating existing literature [33]. 4.3. Phase B The preliminary, automatically generated results regarding the phase B are displayed in Table 5. Before analysis of the results, it is important to note that the metrics used to evaluate the systems in the competition is a manual evaluation of the ideal answers, rather than these automatic metrics. Overall, our systems showed a reasonable performance on the automatic metrics, at best placing 9th, and the remainder of the submission are mostly below the median position of the submissions. The metrics used in the competition Rougue-2 and Rogue-SU4, in the results presented, we show Rogue-2(F1). Given our approach used to generate the text was from an unsupervised nature, this is not surprising, as our system is not guided to generate expected BioASQ answers. Nevertheless, the answers can be correct and therefore missed by the automatic metrics. In Appendix A we showcase some examples of answerers that were generated by the OA-LLAMA-30B model and the OA-Pythia-12B model.

5. Discussion

Throughout phase A, we observed that our reranking methods consistently enhanced the baseline ranking order, which is known to be a challenging task, as mentioned in [ 9 ]. To provide a more tangible visualization of these improvements, we present in Figure 6 the ratio of improvement achieved by our reranking models in comparison to the BM25 baseline. Remarkably, across all batches, our reranking models achieved an average improvement of 30%, and in some cases, even nearing 40%. We attribute these notable gains to two primary factors. Firstly, the System 0 System 1 System 2 System 3 System 4 ) 0 1P@ 0.4 A M ( ien 0.3 l e s a b r veo 0.2 o it a tnR 0.1 e m e v rop 0 m I

Batch 1

Batch 2

Batch 3

Batch 4

System quality of our training data played a crucial role, as we focused on meticulous cleaning of the gold standard data prior to training our models. Additionally, the availability of more advanced training algorithms enabled eficient fine-tuning of entire transformer-based models, further contributing to the model’s performance.

Next, we delve into a detailed discussion of various factors that impact the performance of our systems, namely model architecture, loss function, the number of reranked documents, and the utilization of synthetic data during pretraining. To facilitate this analysis, we present parallel plots in Figures 7 and 8, showcasing these variables for the models used in the second and fourth batches, respectively. Although we focus on these two batches for clarity, it is worth noting that the first and third batches follow similar patterns.

Upon examining both figures, it becomes evident that the preferred architecture and loss function for optimal performance are PubMedBERT and pointwise, respectively, as these models achieved the highest MAP@10 scores according to the plots. Furthermore, in terms of the number of documents used for reranking, it appears that increasing the count does not lead to improved metrics. This observation may be attributed to the fact that the evaluation metrics only consider the top ten documents. This consideration arises from the fact that the BioASQ team evaluates the system’s performance based on the top 10 documents only. Therefore, when there are already enough positive documents among the top 100, reranking a larger number of documents may not result in noticeable improvements.

Finally, the impact of synthetic data yields contradictory results. In the case of the second batch (Figure 7), incorporating synthetic data did not contribute to an overall performance improvement. However, for the fourth batch, it did exhibit a positive efect. We speculate that this discrepancy may be attributed to the quantity and coverage of the synthetic questions generated. Specifically, for the fourth batch, the test set questions may have been closer to those synthetically generated, particularly in terms of the documents used for their generation. Further investigation is needed to validate this hypothesis.

The generation phase (Phase B) of our system presented several insightful findings. Notably, we observed a positive correlation between the size of the language model and the quality of generated answers, which aligns with previous findings that larger models generally tend to

Synthetic data

True

Type monoT5-large monoT5-base 750 500 250 100 750 500 250 100

False False PubMedBERT

Pairwise perform better [34, 35].

Additionally, we found that small modifications to the prompt significantly impacted the system’s output, suggesting that the models may struggle with generalization. This efect was more pronounced in smaller models, indicating that fine-tuning may be necessary to achieve optimal results [36]. In contrast, for larger models, the quality of generation was less afected by the prompt variation, showcasing their robustness.

Overall, the text generation quality was satisfactory, demonstrating coherence and relevance to the biomedical questions. The employment of diferent prompts for various question types particularly enhanced the performance of smaller models, aligning them more closely with the inherent intricacies of each question category.

We also explored ensembling multiple contexts to improve answer diversity and depth. Unfortunately, our attempts were not fruitful, suggesting that this method might require further refinement for it to be efective in this specific task.

Finally, we hypothesize that with some pre-training or domain-specific training, the models might perform even better. Such training could enhance their ability to generate precise and contextually accurate answers for biomedical questions, further increasing their utility in real-world applications [37]. 0.38 0.36

6. Conclusion

In this paper, we detailed our participation on tasks B phase A and B of the eleventh edition of the BioASQ challenge. For phase A, we adopted a two-stage retrieval pipeline comprising the Anserini BM25 as the initial stage, followed by reranker models based on PubMedBERT and monoT5 transformer-based models. In order to efectively train the reranker models efectively, we enhanced the quality of the gold standard data through careful cleaning and also explored synthetic data augmentation techniques through question generation. By using these methods, we achieved significant improvements over the baseline ranking order. Our systems, were able to place first in various batches of the competition.

For phase B, our approach involved leveraging instruction transformer-based models to generate answers conditioned on the articles retrieved during phase A in a zero-shot setting. We observed a positive correlation between the size of the language model and the quality of the generated answers. Smaller models were more sensitive to prompt variations, indicating the need for nfie-tuning to enhance their performance. Larger models, on the other hand, exhibited greater robustness and generated coherent and relevant answers. The employment of diferent prompts for various question types improved the performance of smaller models, aligning them more closely with the specific intricacies of each question category. Overall our performance on the phase B, according to the automatic metrics, was mediocre. However, we believe that further manual analysis is required for a fair evaluation given the unsupervised nature of our generation method.

7. Future Work

In terms of the direction for future work, several promising avenues appear worthy of exploration, particularly for Phase B of our system.

First, while our initial attempts to join multiple contexts (ensembling) did not yield the anticipated results, we believe this approach still holds considerable potential. Therefore, refining our ensembling techniques to efectively integrate diferent contexts into the questionanswering process will be an area of interest. This could potentially enhance both the diversity and depth of our generated answers.

Second, the incorporation of snippet extraction as an intermediary step in our approach might serve to enhance the precision of our answer generation. Extracting relevant snippets from the retrieved documents could refine the context that is fed into our models, potentially leading to more accurate and relevant answers. Several recent works have reported success using such techniques [38].

Lastly, fine-tuning the models specifically for the ideal answers in Phase B of Task B could further boost performance. As we observed that prompt changes significantly impacted the system’s output, especially for smaller models, task-specific fine-tuning might increase the models’ robustness against these changes and enhance their overall performance. In fact, recent studies have shown that fine-tuning large-scale pre-trained models on downstream tasks can lead to substantial improvements in task performance [35, 36].

Acknowledgments

This work was partially supported by national funds through the Foundation for Science and Technology (FCT) in the context of the project UIDB/00127/2020. Tiago Almeida is funded by FCT under the grant 2020.05784.BD. Jorge Miguel Silva has received funding from the EC under grant agreement 101081813, Genomic Data Infrastructure. [33] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet and individual rank learning methods, in: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, Association for Computing Machinery, New York, NY, USA, 2009, p. 758–759. URL: https: //doi.org/10.1145/1571941.1572114. doi:10.1145/1571941.1572114. [34] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [35] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Research 21 (2020) 5485–5551. [36] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, arXiv preprint arXiv:1910.13461 (2019). [37] U. Khandelwal, K. Clark, D. Jurafsky, L. Kaiser, Sample eficient text summarization using a single pre-trained transformer, arXiv preprint arXiv:1905.08836 (2019). [38] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, Q. Liu, ERNIE: Enhanced language representation with informative entities, arXiv preprint arXiv:1905.07129 (2019).

A. Examples of answers generation.

Question: Which a r e t h e t y p e s o f Charcot −Marie −Tooth

D i s e a s e (CMT) ? OA-LLaMA-30B: There a r e f o u r main t y p e s o f Charcot −Marie −Tooth D i s e a s e (CMT) : CMT1 , CMT2 , CMT3 , and CMT4 .

Question: What a r e Luminopsins ? OA-LLaMA-30B: Luminopsins a r e c h i m e r i c p r o t e i n s

c o n s i s t i n g o f a l u c i f e r a s e f u s e d t o an o p s i n t h a t a l l o w f o r c o n t r o l o f n e u r o n a l a c t i v i t y through o p t o g e n e t i c s .

Pythia-12B: Luminopsins ( a b b r e v i a t e d a s LMOs or Lumisomes ) a r e g e n e t i c a l l y encoded p h o t o s w i t c h e s t h a t respond t o b l u e l i g h t by changing t h e i r a b s o r p t i o n spectrum from 500 nm t o 600 −700nm . [ 1 ] [ 2 ] They c o n s i s t o f a f u s i o n p r o t e i n b e t w e e n l u c i f e r a s e s such a s f i r e f l y L u c i n a p e l l u c i d a [ b ] ( . . . )

Answer generated 2: Question id 6419cd02690f196b5100003a

Question: I s SARS−CoV−2 t r a n s m i t t e d through b r e a s t milk ? OA-LLaMA-30B: No , a c c o r d i n g t o t h i s study , t h e r e i s no e v i d e n c e t h a t b r e a s t milk c o n t a i n s i n f e c t i o u s SARS− COVID − 2 . Pythia-12B: The p a t i e n t ’ s symptoms a r e c o n s i s t e n t with t h o s e seen i n i n d i v i d u a l s with P r o p r o t e i n C o n v e r t a s e

S u b t i l i s i n / Kexin Type 1 ( PCSK1 ) m u t a t i o n s , which r e s u l t i n d e c r e a s e d a c t i v i t y or l a c k o f e x p r e s s i o n o f t h e p r o t e i n . S p e c i f i c a l l y , t h e s e m u t a t i o n s l e a d t o a d e c r e a s e i n p r o t e o l y t i c p r o c e s s i n g o f p e p t i d e s i n v o l v e d i n n u t r i e n t t r a n s p o r t a c r o s s e p i t h e l i a such a s jejunum and ileum , r e s u l t i n g i n m a l a b s o r p t i o n o f c a r b o h y d r a t e s , f a t s , and p r o t e i n s . T h i s may e x p l a i n why t h e p a t i e n t e x p e r i e n c e d e p i s o d e s o f f a s t i n g hypokalemia s e c o n d a r y t o c a r b o h y d r a t e and f a t m a l d i g e s t i o n w h i l e a l s o e x p e r i e n c i n g d i a r r h e a due t o poor d i g e s t i o n and m a l a s s i m i l a t i o n o f m a c r o n u t r i e n t s . I t s h o u l d be noted t h a t o t h e r f a c t o r s such a s i m b a l a n c e s i n e l e c t r o l y t e s and b i c a r b o n a t e p r o d u c t i o n c o u l d c o n t r i b u t e t o t h e p a t i e n t ’ s phenotype a s w e l l . A d d i t i o n a l l y , g i v e n t h e r a r i t y o f t h e c o n d i t i o n , i t i s p o s s i b l e t h a t t h e r e a r e u n d i s c o v e r e d m o d i f i e r genes t h a t i n f l u e n c e p e n e t r a n c e and e x p r e s s i v i t y o f PCSK1 m u t a t i o n s , f u r t h e r c o m p l i c a t i n g d i a g n o s i s and management o f a f f e c t e d i n d i v i d u a l s ( . . . )

[1]

Klerings ,

A. S.

Weinhandl ,

K. J.

Thaler , Information overload in healthcare: too much of a good thing? , Zeitschrift für Evidenz , Fortbildung und Qualität im Gesundheitswesen 109 ( 2015 ) 285 - 290 .

[2]

Tsatsaronis ,

Schroeder , G. Paliouras,

Almirantis ,

Androutsopoulos , E. Gaussier,

Gallinari ,

Artieres ,

M. R.

Alvers ,

Zschunke , et al., BioASQ: A challenge on largescale biomedical semantic indexing and question answering., in: AAAI fall symposium: Information retrieval and knowledge discovery in biomedical text , Arlington, VA : Citeseer, 2012 .

[3]

Nentidis ,

Katsimpras ,

Krithara ,

Lima-López ,

Farré-Maduell ,

Gasco ,

Krallinger , G. Paliouras, Overview of BioASQ 2023 : The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2023 ), 2023 .

[4]

Robertson ,

Zaragoza , The Probabilistic Relevance Framework: BM25 and Beyond , Foundations and Trends® in Information Retrieval 3 ( 2009 ) 333 - 389 . URL: https://www. nowpublishers.com/article/Details/INR-019. doi: 10 .1561/1500000019.

[5]

Nogueira ,

Jiang ,

Pradeep ,

Lin , Document Ranking with a Pretrained Sequence-to-Sequence Model, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics , Online, 2020 , pp. 708 - 718 . URL: https://www.aclweb.org/anthology/2020.findings-emnlp. 63 . doi: 10 .18653/v1/ 2020 .findings-emnlp. 63 .

[6]

Gu ,

Tinn , H. Cheng, M. Lucas,

Usuyama ,

Liu ,

Naumann ,

Gao ,

Poon , Domain-specific language model pretraining for biomedical natural language processing , ACM Transactions on Computing for Healthcare (HEALTH) 3 ( 2021 ) 1 - 23 .

[7]

Nentidis ,

Katsimpras ,

Vandorou ,

Krithara , G. Paliouras, Overview of BioASQ tasks 10a, 10b and Synergy10 in CLEF2022 , in: G. Faggioli,

Ferro ,

Hanbury , M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum , Bologna, Italy, September 5th - to - 8th, 2022 , volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org , 2022 , pp. 171 - 178 . URL: https://ceur-ws. org/ Vol- 3180 /paper-10.pdf.

[8]

Almeida ,

Pinho ,

Pereira ,

Matos , Deep Learning solutions based on fixed contextualized embeddings from PubMedBERT on BioASQ 10b and traditional IR in Synergy , in: G. Faggioli,

Ferro ,

Hanbury , M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum , Bologna, Italy, September 5th - to - 8th, 2022 , volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org , 2022 , pp. 204 - 221 . URL: https://ceur-ws. org/ Vol- 3180 /paper-12.pdf.

[9]

Almeida ,

Matos , Universal passage weighting mecanism (UPWM) in BioASQ 9b , in: G. Faggioli,

Ferro ,

Joly ,

Maistro ,

Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum , Bucharest, Romania, September 21st - to - 24th, 2021 , volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org , 2021 , pp. 196 - 212 . URL: https://ceur-ws. org/ Vol- 2936 /paper-13.pdf.

[10]

Dai ,

Callan , Context-Aware

Sentence

/Passage Term Importance Estimation For First Stage Retrieval, 2019 . URL: http://arxiv.org/abs/ 1910 .10687. doi: 10 .48550/arXiv. 1910 . 10687 , arXiv: 1910 .10687 [cs].

[11]

Dai ,

Callan , Context-Aware Document Term Weighting for Ad-Hoc Search , in: Proceedings of The Web Conference 2020 , WWW '20, Association for Computing Machinery, New York, NY, USA, 2020 , pp. 1897 - 1907 . URL: https://dl.acm.org/doi/10.1145/3366423. 3380258. doi: 10 .1145/3366423.3380258.

[12]

Gao ,

Callan , Condenser: a pre-training architecture for dense retrieval , arXiv preprint arXiv:2104.08253 ( 2021 ).

[13]

Karpukhin ,

Oguz ,

Min ,

Wu ,

Edunov ,

Chen , W. Yih, Dense passage retrieval for open-domain question answering , CoRR abs/ 2004 .04906 ( 2020 ). URL: https: //arxiv.org/abs/ 2004 .04906. arXiv: 2004 .04906.

[14]

Xiong ,

Li ,

Tang , J. Liu,

P. N.

Bennett ,

Ahmed ,

Overwijk , Approximate nearest neighbor negative contrastive learning for dense text retrieval , CoRR abs/ 2007 .00808 ( 2020 ). URL: https://arxiv.org/abs/ 2007 .00808. arXiv: 2007 .00808.

[15]

Johnson , M. Douze,

Jégou , Billion-scale similarity search with GPUs , IEEE Transactions on Big Data 7 ( 2019 ) 535 - 547 .

[16]

Zhan ,

Mao ,

Liu ,

Guo ,

Zhang , S. Ma, Optimizing dense retrieval model training with hard negatives , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021 , pp. 1503 - 1512 .

[17]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[18]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , et al., Llama: Open and eficient foundation language models , arXiv preprint arXiv:2302.13971 ( 2023 ).

[19]

Taori , I. Gulrajani,

Zhang ,

Dubois ,

Li ,

Guestrin ,

Liang , T. B. Hashimoto , Stanford alpaca: An instruction-following LLaMA model , https://github.com/tatsu-lab/ stanford_alpaca, 2023 .

[20]

Biderman ,

Schoelkopf ,

Anthony ,

Bradley , K. O'Brien , E.

Hallahan , M. A.

Khan , S.

Purohit , U. S.

Prashanth , E.

Raf , A.

Skowron , L.

Sutawika , O. van der Wal , Pythia: A suite for analyzing large language models across training and scaling, 2023 . arXiv: 2304 . 01373 .

[21]

Chowdhery ,

Narang ,

Devlin ,

Bosma ,

Mishra ,

Roberts ,

Barham ,

H. W.

Chung ,

Sutton ,

Gehrmann ,

Schuh ,

Shi ,

Tsvyashchenko ,

Maynez ,

Rao ,

Barnes ,

Tay ,

Shazeer ,

Prabhakaran ,

Reif ,

Du ,

Hutchinson ,

Pope ,

Bradbury ,

Austin ,

Isard ,

Gur-Ari ,

Yin ,

Duke ,

Levskaya ,

Ghemawat ,

Dev ,

Michalewski ,

Garcia ,

Misra ,

Robinson ,

Fedus ,

Zhou ,

Ippolito ,

Luan ,

Lim ,

Zoph ,

Spiridonov ,

Sepassi ,

Dohan ,

Agrawal ,

Omernick ,

A. M.

Dai ,

T. S.

Pillai ,

Pellat ,

Lewkowycz , E. Moreira,

Child ,

Polozov ,

Lee ,

Zhou ,

Wang ,

Saeta ,

Diaz ,

Firat ,

Catasta ,

Wei ,

Meier-Hellstern ,

Eck ,

Dean ,

Petrov , N. Fiedel, PaLM: Scaling language modeling with pathways , 2022 . arXiv: 2204 . 02311 .

[22]

Black ,

Gao ,

Wang ,

Leahy , S. Biderman, GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow , 2021 . URL: https://doi.org/10.5281/zenodo.5297715. doi: 10 .5281/zenodo.5297715, If

you

use this software, please cite it using these metadata .

[23]

E. J.

Hu , yelong shen, P. Wallis,

Allen-Zhu ,

Li ,

Wang ,

Wang , W. Chen, LoRA: Low-rank adaptation of large language models , in: International Conference on Learning Representations , 2022 . URL: https://openreview.net/forum?id=nZeVKeeFYf9.

[24]

Köpf ,

Kilcher , D. von Rütte,

Anagnostidis ,

Z.-R.

Tam ,

Stevens ,

Barhoum ,

N. M.

Duc ,

Stanley ,

Nagyfi ,

ES ,

Suri ,

Glushkov ,

Dantuluri ,

Maguire ,

Schuhmann ,

Nguyen , A . Mattick, OpenAssistant conversations - democratizing large language model alignment , 2023 . arXiv: 2304 . 07327 .

[25]

Yang ,

Fang ,

Lin , Anserini: Enabling the use of Lucene for information retrieval research , in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '17, Association for Computing Machinery, New York, NY, USA, 2017 , p. 1253 - 1256 . URL: https://doi.org/10.1145/3077136.3080721. doi: 10 .1145/3077136.3080721.

[26]

Gao ,

Yao , D. Chen, SimCSE: Simple contrastive learning of sentence embeddings , in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Online and

Punta

Cana , Dominican Republic, 2021 , pp. 6894 - 6910 . URL: https://aclanthology.org/ 2021 .emnlp-main. 552 . doi: 10 .18653/ v1/ 2021 .emnlp-main. 552 .

[27]

Yang ,

Fang ,

Lin , Anserini: enabling the use of Lucene for information retrieval research , in: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval , 2017 , pp. 1253 - 1256 .

[28]

Lin ,

Ma , S.- C. Lin , J.-H.

Yang , R.

Pradeep , R.

Nogueira , Pyserini:

An easy-to-use python toolkit to support replicable ir research with sparse and dense representations , arXiv preprint arXiv:2102.10073 ( 2021 ).

[29]

Robertson ,

Zaragoza , et al., The probabilistic relevance framework: BM25 and beyond , Foundations and Trends® in Information Retrieval 3 ( 2009 ) 333 - 389 .

[30]

Luo ,

Sun ,

Xia ,

Qin ,

Zhang , H. Poon, T.-Y. Liu, BioGPT: generative pre-trained transformer for biomedical text generation and mining , Briefings in Bioinformatics 23 ( 2022 ). URL: https://doi.org/10.1093/bib/bbac409. doi: 10 .1093/bib/bbac409.

[31]

Zhang ,

Zhao ,

Saleh , P. J. Liu, PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization , in: Proceedings of the 37th International Conference on Machine Learning, ICML'20 , JMLR.org, 2020 .

[32]

G. V.

Cormack ,

C. L.

Clarke ,

Buettcher , Reciprocal rank fusion outperforms condorcet and individual rank learning methods , in: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , 2009 , pp. 758 - 759 .

Pythia-12B: No , a c c o r d i n g t o t h i s s t u d y which a n a l y z e d o v e r 100 women ' s b r e a s t milk .

Question: Which amino a c i d i n i m p l i c a t e d i n t h e Blue d i a p e r syndrome ? OA-LLaMA-30B: The Blue d i a p e r syndrome i s c a u s e d by a f r a m e s h i f t m u t a t i o n i n t h e P r o p r o t e i n C o n v e r t a s e S u b t i l i s i n / Kexin Type 1 ( PCSK1 ) gene which r e s u l t s i n t h e s u b s t i t u t i o n o f V a l i n e f o r L e u c i n e a t p o s i t i o n 227 ( p . [ V a l 2 2 7 L e u f s ∗ 1 2 ] )