<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BIT.UA at BioASQ 11B: Two-Stage IR with Synthetic Training and Zero-Shot Answer Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tiago Almeida</string-name>
          <email>tiagomeloalmeida@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard A. A. Jonker</string-name>
          <email>richard.jonker@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roshan Poudel</string-name>
          <email>proshan@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge M. Silva</string-name>
          <email>jorge.miguel.ferreira.silva@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sérgio Matos</string-name>
          <email>aleixomatos@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IEETA/DETI, LASI, University of Aveiro</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the eforts of the Biomedical Informatics and Technologies (BIT) group at the University of Aveiro in the eleventh edition of the BioASQ challenge. This paper presents our eforts in the eleventh edition of the BioASQ challenge. We addressed Task B in its two phases: document retrieval (phase A) and question answering (phase B). In phase A, we utilized a sparse retrieval method for initial document retrieval, implemented using Anserini, followed by a re-ranking step using transformer models, including monoT5 and PubMedBERT. Phase B featured the application of large language models (LLMs) to generate answers to questions based on a relevant article, with models such as Alpaca-LoRA, OA-Pythia, and OA-LLaMA. We also explored a variety of prompts and question types, as well as diferent generation strategies to optimize our system's performance. Our systems, in phase A, achieved competitive results scoring at the top and close to the top for all the batches, and achieving the best results in terms of F1 for all the batches. Regarding the phase B, our systems underperformed according to the automatic measures. Code to reproduce our submissions is available at https://github.com/ieeta-pt/BioASQ_11B.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Dense Retrieval</kwd>
        <kwd>Language model</kwd>
        <kwd>Answer Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The realm of biomedical literature has been experiencing an exponential increase, predominantly
driven by the rise in open-access and peer-reviewed publications. This rapid expansion results
in an information overload, posing a significant challenge to researchers, physicians, and other
healthcare practitioners [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As delineated by Klerings et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the primary concern stems
not from the abundance of information but the scarcity of sophisticated information retrieval
systems proficient in managing this growing body of literature. To mitigate this, the BioASQ
challenge is a yearly competition that stimulates the creation of intelligent retrieval systems. In
its eleventh year, the BioASQ challenge [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] comprises several tasks targeting unique facets of
information retrieval and text mining within the biomedical domain.
      </p>
      <p>Task B and the Synergy task emphasises information retrieval and question-answering. Task
B bifurcates into phases A and B. Phase A involves identifying relevant documents or snippets
that answer a biomedical question, while phase B addresses the extraction and generation of
responses. These tasks collectively aim at advancing systems that provide evidence or answers
to open-ended biomedical queries. In contrast, the Synergy task seeks to resolve open-ended
questions about COVID-19 by leveraging IR and QA systems.</p>
      <p>
        This paper describes our participation in Task B phase A and ideal answer in phase B of the
BioASQ challenge. During phase A, we utilized the traditional BM25 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for base document
retrieval, followed by document re-ranking executed via a variety of transformer models,
including monoT5 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and PubMedBERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These models were fine-tuned on prior years’
data, and synthetic data generation was employed to mitigate the constraints of a small dataset
size. During phase B, we adopted a naive unsupervised approach where language models were
prompted to generate answers to a question provided with a article as context. The approach
involved exploring various models and prompts along with difering context selections. Figure 1
shows an illustration of an end-to-end pipeline for information retrieval and answering system.
      </p>
      <p>Following this introduction, Section 2 explains the related work. Section 3 is the
methodological section, where we explore the used datasets and corpora and thoroughly illustrates
the employed methodologies. Section 4 shows our results and section 5 discusses them. The
paper concludes in Section 6, summarising the key findings of our participation, with a brief
discussion of future work in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The BioASQ challenge has consistently catalyzed significant advancements in biomedical
information retrieval and question-answering. Task B, in particular, encapsulates the essence of
these complex processes, focusing on two fundamental fields: Information Retrieval (IR) and
Question Answering (QA).</p>
      <p>
        Fundamentally, IR (phase A) aims to identify and retrieve relevant documents or snippets
that align with a posed biomedical question, thereby addressing the issue of locating pertinent
information within the vast corpus of biomedical literature [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. QA (phase B), on the other
hand, is concerned with extracting and generating comprehensive answers from the retrieved
information. This intricate process requires understanding the question at hand and determining
the most suitable answer by leveraging the context provided by the retrieved documents.
      </p>
      <p>
        In the latest competition, the state-of-the-art performances were achieved by systems that
utilized a two-step process: an initial sparse retrieval system followed by a Transformer-based
re-ranking model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This approach was not unique to a single submission but was rather a
common thread among various entries. Our previous work Almeida et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] also employed a
similar pipeline, that used BM25 as first-stage, and in the second stage, employing powerful
models such as PubMedBERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and UPWM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These models have shown remarkable
proficiency in interpreting intricate biomedical queries and matching it to a relevant article.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Information Retrieval</title>
        <p>
          Information Retrieval (IR) involves identifying relevant documents that match a specific query.
IR can be broadly categorized into sparse retrieval and dense retrieval. Sparse retrieval, usually
associated with more traditional approaches, involves converting text into an inverted index
to enable fast searching. An inverted index stores a mapping of terms to documents. Sparse
retrieval has the advantage that it is fast and explainable. The simpler approach of sparse
retrieval includes Bag-of-Words and term frequency-inverse document frequency (tf-idf). There
are also sparse retrieval techniques that are enhanced by transformer-based models such as
DeepCT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and HDCT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] which produces contextualized term weights that can be stored in
traditional inverted indexes. Nevertheless, one of the most relevant and well-known algorithms
used in sparse retrieval is BM25 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>BM25 = ∑︁
⎛</p>
        <p>tf(, ) · (1 + 1)
⎝ tf(, ) + 1 · (1 −  +  · a|vgd|l ) · ln
︂(  − df() + 0.5 )︂
df() + 0.5
⎞
⎠ .</p>
        <p>Where tf(, ) represents the term frequency of term () in the document (), || represents
the length of the documents, avgdl is the average length of a document in the collection,  is
the number of documents in the collection and tf() is the number of documents containing
term . 1 and  are hyperparameters that can be tuned.</p>
        <p>
          On the other hand, a more recent approach called dense retrieval has emerged, utilizing
transformer models to convert both documents and queries into the same dimensional space [
          <xref ref-type="bibr" rid="ref12 ref33">12</xref>
          ].
In this approach, the query is transformed into a vector representation by the dense retrieval
model. The search process involves comparing the similarity of the query vector against all the
document vectors that have been previously encoded. Prominent approaches in this domain
include DPR [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and ANCE [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which employ transformer-based models to learn a joint
dimensional space for projecting queries and documents in a meaningful way. This enables
queries to be closer in dimensional space to their relevant documents. To facilitate eficient
execution of this type of search, libraries like Facebook’s FAISS [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] ofer a comprehensive
framework designed specifically for this purpose.
        </p>
        <p>
          Both the dense retrieval and sparse retrieval techniques can be broadly classified as
representation-based approaches. In this approach, the document and query are encoded
separately, and the search is performed based on either similarity measures (dense retrieval)
or cumulative scores (sparse retrieval). In contrast, interaction-based approaches jointly score
the query and document, allowing for the extraction of more intricate matching patterns and
potentially improving retrieval results. However, due to the need to score the query against
every document in the collection, interaction-based approaches are not practical for searching
the entire document corpus. Therefore, representation-based approaches are commonly adopted
as first-stage retrieval techniques to reduce the search space. Subsequently, more powerful
interaction-based techniques can be employed to further refine the ranking order, a process
known as re-ranking in the literature. These models are typically trained using pointwise and
pairwise techniques [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Pointwise learning involves assigning a score to each document, and
the ranking is then performed by sorting these scores. On the other hand, pairwise learning
involves comparing pairs of documents and enforcing a margin between positive and negative
document pairs, leading to a more discriminative learning process.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Question Answering</title>
        <p>Question Answering(QA) aims to provide accurate and relevant answers to various questions.
QA tasks can be generally divided into two main categories:
• Extractive QA involves identifying and extracting an answer from the given context.
• Generative QA requires the model to generate an answer freely, sometimes requiring a
context.</p>
        <p>Generative QA can further be divided into open and closed generative QA. In open generative
QA, the text is generated using a context provided. This is not to be confused with open-domain
QA. Closed generative QA has no context, and the model entirely generates the answer.</p>
        <p>
          More recent generative QA approaches leverage large language models (LLMs) for zero-shot
answer generation. In this setup, the model is provided with a query containing the context and
asked to generate an answer. This approach is relatively new in the literature. GPT-3 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] is a
powerful autoregressive language model that uses deep learning to produce human-like text. It
has 175 billion parameters and has been applied successfully in zero-shot tasks that require a
deep understanding of context, making it a suitable choice for generative QA tasks.
        </p>
        <p>
          Other recent LLMs have surfaced, such as LLaMA [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], Alpaca [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and Pythia [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. LLaMA
is a foundation LLM that is based on various transformer-based architectures, namely GPT-3
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], PaLM [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], and GPTNeo [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Alpaca is a LLM based on LLaMA that was fine-tuned on
the text generated by OpenAi’s GPT-3.5. Using this technique of knowledge distillation, LLMs
can be made much smaller without sacrificing too much performance. Alpaca-LoRA 1 employs
an approach known as Low-Rank Adaptation [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], which keeps the pre-trained model weights
constant and introduces trainable rank decomposition matrices at each layer of the Transformer
architecture. This significantly reduces the number of trainable parameters for downstream
tasks. Pythia is a library for Transformers, providing various pre-trained models, which are
also GPT based. OpenAssistant [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] fine-tuned Pythia and LLaMA models on human-labelled
datasets to boost the models’ performance and create an open-source competitor to ChatGPT.
        </p>
        <sec id="sec-2-2-1">
          <title>1https://github.com/tloen/alpaca-lora</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The methodology section commences with a comprehensive overview of the corpora and the
dataset used in each task. Subsequently, it details the methods employed for each task we
participated in.</p>
      <sec id="sec-3-1">
        <title>3.1. Corpora and Dataset</title>
        <p>For Task B, we were provided with a dataset containing data from the first ten editions of
the challenge. The dataset included 4719 questions, categorized as 1417 ‘factoid’, 1271 ‘yesno’,
1130 ‘summaries’, and 901 ‘lists’. Each question was accompanied by its relevant documents,
snippets, concepts, RDF triples, and exact and ideal answers. To construct our corpus, we
utilized the PubMed annual baseline document collections spanning from 2013 to 2023. This
corpus consisted of the abstracts and titles of all documents. The most recent PubMed baseline
collection (2023) contains approximately 35 million documents. However, we encountered a
challenge due to the dynamic nature of the documents. Each year, documents are updated or
removed, which means that the relevant documents for a question in the first edition may no
longer be present in the document collection for the current edition. This posed a problem
when relying solely on the latest baseline collection to extract the title and abstract for accurate
querying. To address this issue, we augmented each question with the year it appeared in,
enabling us to query the relevant documents more precisely.</p>
        <p>
          Additionally, we encountered some documents that were missing titles, abstracts, or both.
This could be due to licensing or linguistic issues. We addressed this by removing these
incomplete documents from the collection. Afterwards, we created sparse Anserini [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] indexes
for each year. Having yearly indexes proved advantageous as it allowed us to search for relevant
documents specific to the year in which a question appeared. This approach enhanced the
accuracy of retrieving pertinent information for each question.
        </p>
        <p>
          Regarding the question dataset, there were cases where questions were repeated or were very
similar although having a diferent set of relevant documents. Due to this fact, we decided to
merge similar questions by merging the set of relevant articles to enrich the training data. To
accomplish this, we leveraged the pre-trained SimCSE [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] model to compute the similarity
between questions. Then, questions with a similarity score above 99% were automatically
merged, while questions with a similarity score between 90% and 99% were manually reviewed.
Another additional step was to remove the questions before BioASQ 4. During these years of
the challenge, the systems were able to use the full-text article from PubMed Central (PMC)
to make judgments. This will lead to situations where the model does not have the necessary
content to make a correct prediction for these document pairs. At the end of this process, the
number of resulting question were 3465 (-30%) totalling, 25781 question-documents positive
pairs. In order to build a training dataset for neural relevance models, we need to also gather
negative question-document pairs, such that the model can learn how to correctly score relevant
and irrelevant documents. To accomplish this, we performed random sampling over the list
of documents provided by the BM25 that were not positives for a given question. This should
result in a list of strong negative documents.
3.1.1. Synthetic Question Generation
Data quality and quantity are crucial for developing strong, efective models in deep learning
for information retrieval and relevance determination. In the previous section, we describe
our pre-processing steps to increase the quality of the gold standard data. However, we are
still missing in terms of data quantity. We propose generating questions by transformer-based
language models to create a synthetic dataset that can be used to pre-train the relevance models
to first learn basic retrieval patterns.
        </p>
        <p>To synthetically generate a question for a given article, we fed an engineered prompt that
tries to condition a language model to generate a question based on the information contained
in the article. More formally, we empirically built the prompt,  = {1, ...,  }, such that a
language model would maximize the probability of a question,  = {1, ...,  }, being sampled
according to Equation 1.</p>
        <p>∼

∏︁  (|1, ...,  , 1, ..., − 1)
=1</p>
        <p>In this work, we mainly used zero-shot question generation since we did not explore training
the language models to generate questions based on the BioASQ data. To further guide the
language model into generating useful questions, we also included a question starting word as
part of the prompt, such that the model will be forced to pick the following words conditioned
on that starting word. Some examples of words that start a question are {What, Which, Is,
List, Are, Does}2, Prompt 1 shows the prompt that we adopted for generating a question in a
zero-shot fashion with OA-pythia 12B model.
(1)
&lt;|prompter|&gt;Given the following context
\"{article}\", generate a question that can be
answered by the information provided in the
context: &lt;|endoftext|&gt;&lt;|assistant|&gt;What
Prompt 1: Example of the last prompt used to generate synthetic questions with OA-pythia
model</p>
        <p>
          Regarding the language models that we used, we tried with small language models like,
GPT-Neo-125M [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and also with the larger ones such as OA-pythia-12B [
          <xref ref-type="bibr" rid="ref20 ref24">20, 24</xref>
          ] model. The
synthetic dataset contained 79855 questions that were generated from 15971 randomly sampled
articles.
3.2. Phase A
Our approach for the phase A of the challenge involved the development of a two-stage retrieval
pipeline designed to handle the large volume of biomedical literature with eficiency. Figure 1
presents the overview of our two-stage retrieval pipeline.
2These words follow the distribution of the starting words that appear in the BioASQ dataset.
        </p>
        <p>
          At first, we utilized a sparse retrieval method. To accomplish this, we constructed an inverted
index, a commonly used data structure in information retrieval that maps terms to the
documents that contain them, using Anserini, a powerful retrieval toolkit built on Lucene [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. For
compatibility with our Python-based pipeline, we used Pyserini, Anserini’s Python wrapper
[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ].
        </p>
        <p>
          For document retrieval, we adopted the BM25 ranking function, which is widely recognized
for its efectiveness [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. We selected the top 100 documents based on BM25 scores as the initial
retrieval result and occasionally extended them to the top 1000 for broader coverage. Figure
3 illustrates that extending from the top 100 to the top 1000 documents increases the number
of expected documents in the set by 20% (from 71% to 91% recall). This extension provides a
higher chance of retrieving more relevant documents. However, it comes with a trade-of in
speed, as the neural retrieval system needs to process ten times more documents. It is worth
noting that if the top 100 documents already contain a suficient number of positive documents,
using the top 1000 may not yield significant gains in metrics. This observation will be later
addressed in the discussion section.
        </p>
        <p>0.9
ll 0.8
a
ceR0.7
0.6
0.5
0
500
1000
1500</p>
        <p>2000 2500 3000 3500
Size of the document retrieved set
4000
4500
5000</p>
        <p>The parameters for the BM25, specifically  and 1, were selected through a preliminary
hyperparameter tuning process. Figure 4 shows a summary of all the runs and their respective
parameters. Based on this we adopted the parameters 1 = 0.5 and  = 0.3.</p>
        <p>
          In the second-stage, we utilized re-ranking models, which includes state-of-the-art
transformer-based models such as PubMedbert [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and monoT5 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] (both base and large
variantes). We also considered the BioGPT [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] and Pegasus [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] models, but due to their higher
computational cost, they were discarded. These models were trained using both pointwise and
pairwise approaches to evaluate their efectiveness in difering scenarios. To expand upon the
limited availability of training data, we also experimented with including synthetic data in our
training regimen as a pretraining mechanism.
        </p>
        <p>
          Finally, to consolidate the output from several models, we used the reciprocal rank fusion
(RRF) [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. This approach acts as an ensemble technique to improve the overall ranking order
of the relevant documents by considering the judgment of multiple models.
3.2.1. Submissions
The runs submitted to the phase A challenge were ensembles of various trained models with
diferent checkpoints. The various systems submitted are briefly described in Table 1.
• System-0: This system contained 3 PubMedBERT models that re-ranked 1000 documents
fetched from BM25.
• System-1: This system contained 5 PubMedBERT models that re-ranked 100 documents
fetched from BM25.
• System-2: For the first batch, the system contained 2 T5-base and 2 T5-Large models
that re-ranked 100 documents from BM25. For the rest of the batches, the system used
an ensemble of models trained on synthetic data and then fine-tuned on the challenge
data. The models ensembled were 7 PubMedBERT models, 5 of which re-ranked 1000
documents, and the remaining 2 re-ranked 200 documents.
• System-3: In the first batch, an ensemble of 2 T5-base models and 2 PubMedBERT models
were used to re-rank 100 documents. The following 2 batches investigated pairwise
training with synthetic data, where 7 PubMedBERT models were trained using a pairwise
loss function. Among them, 4 models re-ranked 100 documents, and 3 re-ranked 500
documents. In the final batch, some models were removed and replaced with models from
the first system.
• System-4: In the first batch, the system contained 2 T5-Large models and 2 T5-Base
models. The Large models re-ranked 1000 documents, and the Base models re-ranked
100 documents. In the remaining submissions, we ensembled most of our trained models,
reaching a total of 25 models. However, it should be noted that only 24 models were used
in the last batch.
3.3. Phase B
        </p>
        <p>To provide a natural language answer to a question, we adopted an exploratory approach,
testing various prompts and models to gauge their efectiveness in generating precise and
meaningful answers. Recognizing the varying complexities inherent to diferent question types,
we also experimented with per-question type prompting. This approach considers the nature
of the question—be it factoid, list, or summary—and tailors the model’s prompt accordingly,
enabling more accurate and contextually relevant responses, see Prompt 2 as reference.</p>
        <p>Below is an instruction that describes a task,
paired with an input that provides further
context. Write a response that appropriately
completes the request.
### Instruction:
{instruction}
### Input:
ABSTRACT: {text}
QUESTION: {question}
### Response:
Prompt 2: Example of a zero-shot prompt for generation answers with the Alpaca-LoRa model.</p>
        <p>The text within brackets correspond to placehorders for the instruction, question
and article text. The default instruction was “Given the ABSTRACT, answer the
QUESTION”. For yesno type of question we used the following “Given the input
ABSTRACT produce a yes or no answer to QUESTION”, while for the summary
type we used “Given the input ABSTRACT produce a short and concise answer to
QUESTION”.</p>
        <p>Context selection should play a large part in the quality of the text generation. We tested this
using our top retrieved article from phase A, the top gold standard article, or a combination
of both as context for the model. The latter was accomplished by selection the gold standard
article that was ranked higher according to our model. A key focus of our experimentation was
the application of various advanced language models such as ALPACA-LoRA (13 billion),
OAPythia (12 billion) and OA-LLaMA (30 billion) models. Furthermore, we dabbled with diferent
answer-generation strategies, including random sampling, beam search and contrastive search.
In random search, a random token is selected for the next token following the probability
distribution of the model. In beam search, multiple possible continuations at each step are
explored based on a predefined beam width, aiming to find the most probable and coherent
sequence of words. Contrastive search involves searching for alternative continuations or
completions by contrasting diferent options and selecting the most distinctive or interesting
one. We also extensively tested diferent hyperparameters for model generation, including
temperature and the maximum token length. This experimentation allowed us to fine-tune our
models’ performance, leading to more precise and informative answers.
3.3.1. Submissions
For Phase B, our submissions consisted of various instruction transformer-based models, each
described concisely in Table 2. The “Document Source” column specifies the origin of the article
used as context for answer generation. Specifically, “System-0”and “System-4” correspond to
the highest scoring documents outputted by the respective phase A system. On the other hand,
“Gold” indicates that the document was obtained from the provided gold standard.</p>
        <p>More precisely, due to time constraints, for the first batch we selected only a single model
for submission using the top ranked document from our best performing model in phase A.
This was seen as a naive approach, which is why in further batches we tested with both our
models and the gold standard document to answer a question. In the second batch we used
the same Alpaca-LoRA model, with the addition of the 30 billion parameter version, tested on
the best performing model of the batch and also using the gold standard documents. For the
third batch, we were unable to submit 5 submissions due to technical problems, however in this
submission we changed the model to OpenAssistant’s Pythia 12 billion parameter model. In the
ifnal batch, we additionally tested the OpenAssistant LLaMA model with 30 billion parameters.
Regarding the generation strategies, we adopted contrastive search for the Alpaca-LoRA and
random sample with high confidence for the OpenAssistant variantes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section starts by addressing our validation results measured over a subset of the training
data. Then we show the oficial preliminary results of the BioASQ challenge for phase A and
B. Note that the preliminary results are the results available at time of writing and are due to
changes after the reevaluation period. To see the oficial results, use the BioASQ 11B oficial
leaderboard3.
3Phase A: http://participants-area.bioasq.org/results/11b/phaseA/, Phase B: http://participants-area.bioasq.org/
results/11b/phaseB/</p>
      <p>Model type
PubMedBERT
PubMedBERT
PubMedBERT
monoT5-base
monoT5-base
monoT5-large</p>
      <p>BM25
x
x
x
x
x
x
x</p>
      <sec id="sec-4-1">
        <title>4.1. Validation results</title>
        <p>The validation of our models was conducted to assess their performance and gain valuable
insights in their configuration. In this section, we summarize the validation results obtained
over a subset of the training data. More precisely, we performed a stratified train/test split of
95/5 of the dataset, which corresponds to 3292 questions for training and 173 for validation.
Taking into consideration that the oficial evaluation batch only contains 90 questions, we
believe that our split was representative.</p>
        <p>Table 3 summarizes the best validation results of various neural relevance models trained
on diferent subsets of data and also the BM25 baseline. The models were evaluated based on
their Mean Average Precision at 10 (MAP@10) score, which measures the average precision of
the top 10 retrieved documents for each query. Each neural model was trained using diferent
combinations of training data, including synthetic and gold standard datasets.</p>
        <p>Overall, when training with the gold standard data, all the reranking methods are capable
to improve upon the baseline, reinforcing the idea that it is beneficial to adopt a reranking
method as a second-stage mechanism of a retrieval pipeline. Regarding the architectures, the
PubMedBERT and monoT5-large architectures managed to achieve comparable performances,
whilst the monoT5-base architecture achieved considerably poor results. This disparity may
be attributed to the fact that monoT5 is a sequence-to-sequence model that directly learns the
retrieval task using natural language, placing greater emphasis on the quality of the underlying
language model, and, therefore, their size.</p>
        <p>Notably, the best configuration we obtained involved training the PubMedBERT model with
synthetically generated data and subsequently fine-tuning it with the gold data. This outcome
highlights the beneficial impact of incorporating synthetic data.</p>
        <p>Furthermore, an unexpected result emerged when comparing the performance of models
that were only trained with synthetic data against the BM25 baseline. It was surprising to
observe that using only synthetic data yielded improvements over the performance of BM25.
This suggests that it is indeed possible to train models without relying on gold data and still
achieve superior performance compared to traditional baselines such as BM25. This finding
opens up new possibilities for model training and highlights the potential of synthetic data as a
valuable resource for retrieval tasks where no labelled data is available.
4.2. Phase A
The preliminary results of our submissions are displayed in Table 4, showcasing the rankings
based on Mean Average Precision at 10 (MAP@10). Additionally, we provide the results
regarding F1-score at 10, ofering insights into the trade-of between precision and recall across
the systems. The Top Competitor represents the most successful system among all competitor
systems. Overall, we achieved highly competitive results, achieving the best-performing system
in the first and second batches in MAP@10 and the best-performing system in all the batches
in the F1-score. Significantly, the systems that attained these high F1-scores were relevance
models, designed to discard documents if the likelihood of their relevance fell below 1%.
Consequently, for questions with less than 10 positive documents, these systems were capable of
outputting fewer than 10 documents, thus increasing precision compared to a ranking model
that consistently outputs 10 documents regardless of their scores.</p>
        <p>Comparing now the performance between the systems, the initial two, namely System-0 and
System-1, were employed to study the diference between re-ranking 1,000 and 100 documents.
An analysis of these models’ results across various batches revealed that the performance was
not significantly afected by the increase in re-ranked documents. This observation will be
further revisited in the subsequent discussion section.</p>
        <p>Upon evaluating the remaining systems for the first batch, it was discerned that the utilization
of T5 models did not significantly enhance performance compared to the BERT models. This
observation carries substantial importance, especially given that the inference time for T5
models exceeded that of the BERT-based models. Consequently, the decision was taken to cease
the deployment of T5 models in subsequent submissions, favouring instead the more eficient
BERT models, which delivered adequate performance. Furthermore, System-4 for the initial
batch demonstrated an unexpectedly lower Mean Average Precision (MAP) compared to the
outcomes of other ensemble methods. This indicates that the specific configuration or ensemble
of models in System-4 did not yield the anticipated results.</p>
        <p>Furthermore, upon comparing System-2 and System-3, the distinctive variance can be traced
back to the training technique deployed. It was deduced that pairwise training slightly
underperformed compared to pointwise training methods. As a consequence, only pointwise training
was used in the final batch.</p>
        <p>Turning to the final system, System-4, it was observed that ensembling more models
consistently outperformed the other systems in all instances. Again, this is an anticipated result
corroborating existing literature [33].
4.3. Phase B
The preliminary, automatically generated results regarding the phase B are displayed in Table
5. Before analysis of the results, it is important to note that the metrics used to evaluate the
systems in the competition is a manual evaluation of the ideal answers, rather than these
automatic metrics. Overall, our systems showed a reasonable performance on the automatic
metrics, at best placing 9th, and the remainder of the submission are mostly below the median
position of the submissions. The metrics used in the competition Rougue-2 and Rogue-SU4, in
the results presented, we show Rogue-2(F1). Given our approach used to generate the text was
from an unsupervised nature, this is not surprising, as our system is not guided to generate
expected BioASQ answers. Nevertheless, the answers can be correct and therefore missed by
the automatic metrics. In Appendix A we showcase some examples of answerers that were
generated by the OA-LLAMA-30B model and the OA-Pythia-12B model.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>
        Throughout phase A, we observed that our reranking methods consistently enhanced the
baseline ranking order, which is known to be a challenging task, as mentioned in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To provide
a more tangible visualization of these improvements, we present in Figure 6 the ratio of
improvement achieved by our reranking models in comparison to the BM25 baseline. Remarkably,
across all batches, our reranking models achieved an average improvement of 30%, and in some
cases, even nearing 40%. We attribute these notable gains to two primary factors. Firstly, the
System 0
System 1
System 2
System 3
System 4
)
0
1P@ 0.4
A
M
(
ien 0.3
l
e
s
a
b
r
veo 0.2
o
it
a
tnR 0.1
e
m
e
v
rop 0
m
I
      </p>
      <p>Batch 1</p>
      <p>Batch 2</p>
      <p>Batch 3</p>
      <p>Batch 4</p>
      <p>System
quality of our training data played a crucial role, as we focused on meticulous cleaning of the
gold standard data prior to training our models. Additionally, the availability of more advanced
training algorithms enabled eficient fine-tuning of entire transformer-based models, further
contributing to the model’s performance.</p>
      <p>Next, we delve into a detailed discussion of various factors that impact the performance of
our systems, namely model architecture, loss function, the number of reranked documents,
and the utilization of synthetic data during pretraining. To facilitate this analysis, we present
parallel plots in Figures 7 and 8, showcasing these variables for the models used in the second
and fourth batches, respectively. Although we focus on these two batches for clarity, it is worth
noting that the first and third batches follow similar patterns.</p>
      <p>Upon examining both figures, it becomes evident that the preferred architecture and loss
function for optimal performance are PubMedBERT and pointwise, respectively, as these models
achieved the highest MAP@10 scores according to the plots. Furthermore, in terms of the
number of documents used for reranking, it appears that increasing the count does not lead to
improved metrics. This observation may be attributed to the fact that the evaluation metrics
only consider the top ten documents. This consideration arises from the fact that the BioASQ
team evaluates the system’s performance based on the top 10 documents only. Therefore, when
there are already enough positive documents among the top 100, reranking a larger number of
documents may not result in noticeable improvements.</p>
      <p>Finally, the impact of synthetic data yields contradictory results. In the case of the second
batch (Figure 7), incorporating synthetic data did not contribute to an overall performance
improvement. However, for the fourth batch, it did exhibit a positive efect. We speculate that
this discrepancy may be attributed to the quantity and coverage of the synthetic questions
generated. Specifically, for the fourth batch, the test set questions may have been closer to
those synthetically generated, particularly in terms of the documents used for their generation.
Further investigation is needed to validate this hypothesis.</p>
      <p>The generation phase (Phase B) of our system presented several insightful findings. Notably,
we observed a positive correlation between the size of the language model and the quality of
generated answers, which aligns with previous findings that larger models generally tend to</p>
      <p>Synthetic data</p>
      <p>True</p>
      <p>Type
monoT5-large
monoT5-base
750
500
250
100
750
500
250
100</p>
      <p>False
False
PubMedBERT</p>
      <p>Pairwise
perform better [34, 35].</p>
      <p>Additionally, we found that small modifications to the prompt significantly impacted the
system’s output, suggesting that the models may struggle with generalization. This efect was
more pronounced in smaller models, indicating that fine-tuning may be necessary to achieve
optimal results [36]. In contrast, for larger models, the quality of generation was less afected
by the prompt variation, showcasing their robustness.</p>
      <p>Overall, the text generation quality was satisfactory, demonstrating coherence and relevance
to the biomedical questions. The employment of diferent prompts for various question types
particularly enhanced the performance of smaller models, aligning them more closely with the
inherent intricacies of each question category.</p>
      <p>We also explored ensembling multiple contexts to improve answer diversity and depth.
Unfortunately, our attempts were not fruitful, suggesting that this method might require further
refinement for it to be efective in this specific task.</p>
      <p>Finally, we hypothesize that with some pre-training or domain-specific training, the models
might perform even better. Such training could enhance their ability to generate precise and
contextually accurate answers for biomedical questions, further increasing their utility in
real-world applications [37].
0.38
0.36</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we detailed our participation on tasks B phase A and B of the eleventh edition of
the BioASQ challenge. For phase A, we adopted a two-stage retrieval pipeline comprising the
Anserini BM25 as the initial stage, followed by reranker models based on PubMedBERT and
monoT5 transformer-based models. In order to efectively train the reranker models efectively,
we enhanced the quality of the gold standard data through careful cleaning and also explored
synthetic data augmentation techniques through question generation. By using these methods,
we achieved significant improvements over the baseline ranking order. Our systems, were able
to place first in various batches of the competition.</p>
      <p>For phase B, our approach involved leveraging instruction transformer-based models to
generate answers conditioned on the articles retrieved during phase A in a zero-shot setting.
We observed a positive correlation between the size of the language model and the quality of
the generated answers. Smaller models were more sensitive to prompt variations, indicating the
need for nfie-tuning to enhance their performance. Larger models, on the other hand, exhibited
greater robustness and generated coherent and relevant answers. The employment of diferent
prompts for various question types improved the performance of smaller models, aligning them
more closely with the specific intricacies of each question category. Overall our performance
on the phase B, according to the automatic metrics, was mediocre. However, we believe that
further manual analysis is required for a fair evaluation given the unsupervised nature of our
generation method.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Future Work</title>
      <p>In terms of the direction for future work, several promising avenues appear worthy of
exploration, particularly for Phase B of our system.</p>
      <p>First, while our initial attempts to join multiple contexts (ensembling) did not yield the
anticipated results, we believe this approach still holds considerable potential. Therefore,
refining our ensembling techniques to efectively integrate diferent contexts into the
questionanswering process will be an area of interest. This could potentially enhance both the diversity
and depth of our generated answers.</p>
      <p>Second, the incorporation of snippet extraction as an intermediary step in our approach
might serve to enhance the precision of our answer generation. Extracting relevant snippets
from the retrieved documents could refine the context that is fed into our models, potentially
leading to more accurate and relevant answers. Several recent works have reported success
using such techniques [38].</p>
      <p>Lastly, fine-tuning the models specifically for the ideal answers in Phase B of Task B could
further boost performance. As we observed that prompt changes significantly impacted the
system’s output, especially for smaller models, task-specific fine-tuning might increase the
models’ robustness against these changes and enhance their overall performance. In fact, recent
studies have shown that fine-tuning large-scale pre-trained models on downstream tasks can
lead to substantial improvements in task performance [35, 36].</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was partially supported by national funds through the Foundation for Science and
Technology (FCT) in the context of the project UIDB/00127/2020. Tiago Almeida is funded by
FCT under the grant 2020.05784.BD. Jorge Miguel Silva has received funding from the EC under
grant agreement 101081813, Genomic Data Infrastructure.
[33] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet
and individual rank learning methods, in: Proceedings of the 32nd International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09,
Association for Computing Machinery, New York, NY, USA, 2009, p. 758–759. URL: https:
//doi.org/10.1145/1571941.1572114. doi:10.1145/1571941.1572114.
[34] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
neural information processing systems 33 (2020) 1877–1901.
[35] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
Exploring the limits of transfer learning with a unified text-to-text transformer, The
Journal of Machine Learning Research 21 (2020) 5485–5551.
[36] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L.
Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation,
translation, and comprehension, arXiv preprint arXiv:1910.13461 (2019).
[37] U. Khandelwal, K. Clark, D. Jurafsky, L. Kaiser, Sample eficient text summarization using
a single pre-trained transformer, arXiv preprint arXiv:1905.08836 (2019).
[38] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, Q. Liu, ERNIE: Enhanced language representation
with informative entities, arXiv preprint arXiv:1905.07129 (2019).</p>
    </sec>
    <sec id="sec-9">
      <title>A. Examples of answers generation.</title>
      <p>Question: Which a r e t h e t y p e s o f Charcot −Marie −Tooth</p>
      <p>D i s e a s e (CMT) ?
OA-LLaMA-30B: There a r e f o u r main t y p e s o f Charcot −Marie
−Tooth D i s e a s e (CMT) : CMT1 , CMT2 , CMT3 , and CMT4 .</p>
      <sec id="sec-9-1">
        <title>Question: What a r e Luminopsins ?</title>
      </sec>
      <sec id="sec-9-2">
        <title>OA-LLaMA-30B: Luminopsins a r e c h i m e r i c p r o t e i n s</title>
        <p>c o n s i s t i n g o f a l u c i f e r a s e f u s e d t o an o p s i n t h a t
a l l o w f o r c o n t r o l o f n e u r o n a l a c t i v i t y through
o p t o g e n e t i c s .</p>
        <p>
          Pythia-12B: Luminopsins ( a b b r e v i a t e d a s LMOs or Lumisomes
) a r e g e n e t i c a l l y encoded p h o t o s w i t c h e s t h a t respond
t o b l u e l i g h t by changing t h e i r a b s o r p t i o n spectrum
from 500 nm t o 600 −700nm . [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] They c o n s i s t o f a
f u s i o n p r o t e i n b e t w e e n l u c i f e r a s e s such a s f i r e f l y
L u c i n a p e l l u c i d a [ b ] ( . . . )
        </p>
      </sec>
      <sec id="sec-9-3">
        <title>Answer generated 2: Question id 6419cd02690f196b5100003a</title>
        <p>Question: I s SARS−CoV−2 t r a n s m i t t e d through b r e a s t milk ?
OA-LLaMA-30B: No , a c c o r d i n g t o t h i s study , t h e r e i s no
e v i d e n c e t h a t b r e a s t milk c o n t a i n s i n f e c t i o u s SARS−
COVID − 2 .
Pythia-12B: The p a t i e n t ’ s symptoms a r e c o n s i s t e n t with
t h o s e seen i n i n d i v i d u a l s with P r o p r o t e i n C o n v e r t a s e</p>
        <p>S u b t i l i s i n / Kexin Type 1 ( PCSK1 ) m u t a t i o n s , which
r e s u l t i n d e c r e a s e d a c t i v i t y or l a c k o f e x p r e s s i o n
o f t h e p r o t e i n . S p e c i f i c a l l y , t h e s e m u t a t i o n s l e a d
t o a d e c r e a s e i n p r o t e o l y t i c p r o c e s s i n g o f p e p t i d e s
i n v o l v e d i n n u t r i e n t t r a n s p o r t a c r o s s e p i t h e l i a such
a s jejunum and ileum , r e s u l t i n g i n m a l a b s o r p t i o n o f
c a r b o h y d r a t e s , f a t s , and p r o t e i n s . T h i s may e x p l a i n
why t h e p a t i e n t e x p e r i e n c e d e p i s o d e s o f f a s t i n g
hypokalemia s e c o n d a r y t o c a r b o h y d r a t e and f a t
m a l d i g e s t i o n w h i l e a l s o e x p e r i e n c i n g d i a r r h e a due t o
poor d i g e s t i o n and m a l a s s i m i l a t i o n o f
m a c r o n u t r i e n t s . I t s h o u l d be noted t h a t o t h e r
f a c t o r s such a s i m b a l a n c e s i n e l e c t r o l y t e s and
b i c a r b o n a t e p r o d u c t i o n c o u l d c o n t r i b u t e t o t h e
p a t i e n t ’ s phenotype a s w e l l . A d d i t i o n a l l y , g i v e n t h e
r a r i t y o f t h e c o n d i t i o n , i t i s p o s s i b l e t h a t t h e r e
a r e u n d i s c o v e r e d m o d i f i e r genes t h a t i n f l u e n c e
p e n e t r a n c e and e x p r e s s i v i t y o f PCSK1 m u t a t i o n s ,
f u r t h e r c o m p l i c a t i n g d i a g n o s i s and management o f
a f f e c t e d i n d i v i d u a l s ( . . . )</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Klerings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Weinhandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Thaler</surname>
          </string-name>
          ,
          <article-title>Information overload in healthcare: too much of a good thing?</article-title>
          ,
          <source>Zeitschrift für Evidenz</source>
          ,
          <source>Fortbildung und Qualität im Gesundheitswesen</source>
          <volume>109</volume>
          (
          <year>2015</year>
          )
          <fpage>285</fpage>
          -
          <lpage>290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          , G. Paliouras,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Almirantis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          , E. Gaussier,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gallinari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Artieres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zschunke</surname>
          </string-name>
          , et al.,
          <article-title>BioASQ: A challenge on largescale biomedical semantic indexing and question answering., in: AAAI fall symposium: Information retrieval and knowledge discovery in biomedical text</article-title>
          , Arlington,
          <string-name>
            <surname>VA</surname>
          </string-name>
          : Citeseer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          , G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2023</year>
          :
          <article-title>The eleventh BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The Probabilistic Relevance Framework: BM25 and Beyond</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          . URL: https://www. nowpublishers.com/article/Details/INR-019. doi:
          <volume>10</volume>
          .1561/1500000019.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Document Ranking with a Pretrained Sequence-to-Sequence Model, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>708</fpage>
          -
          <lpage>718</lpage>
          . URL: https://www.aclweb.org/anthology/2020.findings-emnlp.
          <volume>63</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>63</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>
          ,
          <source>ACM Transactions on Computing for Healthcare (HEALTH) 3</source>
          (
          <issue>2021</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Vandorou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ tasks 10a, 10b and Synergy10 in CLEF2022</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , M. Potthast (Eds.),
          <source>Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy, September 5th - to - 8th,
          <year>2022</year>
          , volume
          <volume>3180</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>178</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3180</volume>
          /paper-10.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matos</surname>
          </string-name>
          ,
          <article-title>Deep Learning solutions based on fixed contextualized embeddings from PubMedBERT on BioASQ 10b and traditional IR in Synergy</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , M. Potthast (Eds.),
          <source>Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy, September 5th - to - 8th,
          <year>2022</year>
          , volume
          <volume>3180</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>204</fpage>
          -
          <lpage>221</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3180</volume>
          /paper-12.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matos</surname>
          </string-name>
          ,
          <article-title>Universal passage weighting mecanism (UPWM) in BioASQ 9b</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum</source>
          , Bucharest, Romania, September 21st - to - 24th,
          <year>2021</year>
          , volume
          <volume>2936</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>196</fpage>
          -
          <lpage>212</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2936</volume>
          /paper-13.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Context-Aware</surname>
            <given-names>Sentence</given-names>
          </string-name>
          /Passage Term Importance Estimation For First Stage Retrieval,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1910</year>
          .10687. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1910</year>
          .
          <volume>10687</volume>
          , arXiv:
          <year>1910</year>
          .10687 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Context-Aware Document Term Weighting for Ad-Hoc Search</article-title>
          ,
          <source>in: Proceedings of The Web Conference</source>
          <year>2020</year>
          , WWW '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , pp.
          <fpage>1897</fpage>
          -
          <lpage>1907</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3366423. 3380258. doi:
          <volume>10</volume>
          .1145/3366423.3380258.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Condenser: a pre-training architecture for dense retrieval</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08253</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          , CoRR abs/
          <year>2004</year>
          .04906 (
          <year>2020</year>
          ). URL: https: //arxiv.org/abs/
          <year>2004</year>
          .04906. arXiv:
          <year>2004</year>
          .04906.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <article-title>Approximate nearest neighbor negative contrastive learning for dense text retrieval</article-title>
          , CoRR abs/
          <year>2007</year>
          .00808 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2007</year>
          .00808. arXiv:
          <year>2007</year>
          .00808.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , M. Douze,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Billion-scale similarity search with GPUs</article-title>
          ,
          <source>IEEE Transactions on Big Data</source>
          <volume>7</volume>
          (
          <year>2019</year>
          )
          <fpage>535</fpage>
          -
          <lpage>547</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ma,
          <article-title>Optimizing dense retrieval model training with hard negatives</article-title>
          ,
          <source>in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1503</fpage>
          -
          <lpage>1512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          , I. Gulrajani,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dubois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Hashimoto</surname>
          </string-name>
          , Stanford alpaca:
          <article-title>An instruction-following LLaMA model</article-title>
          , https://github.com/tatsu-lab/ stanford_alpaca,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Anthony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bradley</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. O'Brien</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hallahan</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Purohit</surname>
            ,
            <given-names>U. S.</given-names>
          </string-name>
          <string-name>
            <surname>Prashanth</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Raf</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Skowron</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sutawika</surname>
            ,
            <given-names>O. van der Wal</given-names>
          </string-name>
          ,
          <article-title>Pythia: A suite for analyzing large language models across training</article-title>
          and scaling,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>01373</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schuh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsvyashchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Austin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gur-Ari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Michalewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spiridonov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sepassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Omernick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Pillai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pellat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lewkowycz</surname>
          </string-name>
          , E. Moreira,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Polozov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saeta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Firat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Catasta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meier-Hellstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          , N. Fiedel,
          <article-title>PaLM: Scaling language modeling with pathways</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2204</volume>
          .
          <fpage>02311</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leahy</surname>
          </string-name>
          , S. Biderman, GPT-Neo:
          <article-title>Large scale autoregressive language modeling with mesh-tensorflow</article-title>
          ,
          <year>2021</year>
          . URL: https://doi.org/10.5281/zenodo.5297715. doi:
          <volume>10</volume>
          .5281/zenodo.5297715,
          <string-name>
            <surname>If</surname>
            <given-names>you</given-names>
          </string-name>
          <article-title>use this software, please cite it using these metadata</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          , yelong shen, P. Wallis,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kilcher</surname>
          </string-name>
          , D. von Rütte,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anagnostidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-R.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barhoum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Duc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Stanley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nagyfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>ES</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Glushkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dantuluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maguire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mattick, OpenAssistant conversations - democratizing
          <source>large language model alignment</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>07327</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Anserini: Enabling the use of Lucene for information retrieval research</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>1253</fpage>
          -
          <lpage>1256</lpage>
          . URL: https://doi.org/10.1145/3077136.3080721. doi:
          <volume>10</volume>
          .1145/3077136.3080721.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yao</surname>
          </string-name>
          , D. Chen,
          <article-title>SimCSE: Simple contrastive learning of sentence embeddings</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>6894</fpage>
          -
          <lpage>6910</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>552</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2021</year>
          .emnlp-main.
          <volume>552</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>P.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Anserini: enabling the use of Lucene for information retrieval research</article-title>
          ,
          <source>in: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1253</fpage>
          -
          <lpage>1256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S.-
          <string-name>
            <surname>C. Lin</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pradeep</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Nogueira</surname>
            ,
            <given-names>Pyserini:</given-names>
          </string-name>
          <article-title>An easy-to-use python toolkit to support replicable ir research with sparse and dense representations</article-title>
          ,
          <source>arXiv preprint arXiv:2102.10073</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , et al.,
          <article-title>The probabilistic relevance framework: BM25 and beyond</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Poon, T.-Y. Liu,
          <article-title>BioGPT: generative pre-trained transformer for biomedical text generation and mining</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>23</volume>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1093/bib/bbac409. doi:
          <volume>10</volume>
          .1093/bib/bbac409.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleh</surname>
          </string-name>
          , P. J. Liu, PEGASUS:
          <article-title>Pre-training with extracted gap-sentences for abstractive summarization</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Machine Learning, ICML'20</source>
          , JMLR.org,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buettcher</surname>
          </string-name>
          ,
          <article-title>Reciprocal rank fusion outperforms condorcet and individual rank learning methods</article-title>
          ,
          <source>in: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>758</fpage>
          -
          <lpage>759</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>Pythia-12B: No , a c c o r d i n g t o t h i s s t u d y which a n a l y z e d o v e r 100 women ' s b r e a s t milk</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>Question: Which amino a c i d i n i m p l i c a t e d i n t h e Blue d i a p e r syndrome ? OA-LLaMA-30B: The Blue d i a p e r syndrome i s c a u s e d by a f r a m e s h i f t m u t a t i o n i n t h e P r o p r o t e i n C o n v e r t a s e S u b t i l i s i n / Kexin Type 1 ( PCSK1 ) gene which r e s u l t s i n t h e s u b s t i t u t i o n o f V a l i n e f o r L e u c i n e a t p o s i t i o n 227 ( p . [ V a l 2 2 7 L e u f s ∗ 1 2 ] )</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>