Can Open-Source LLMs Compete with Commercial Models? Exploring the Few-Shot Performance of Current GPT Models in Biomedical Tasks Notebook for the BioASQ Lab at CLEF 2024 Samy Ateia1 , Udo Kruschwitz1 1 Information Science, University of Regensburg, Universitätsstraße 31, 93053, Regensburg, Germany Abstract Commercial large language models (LLMs), like OpenAI’s GPT-4 powering ChatGPT and Anthropic’s Claude 3 Opus, have dominated natural language processing (NLP) benchmarks across different domains. New competing Open-Source alternatives like Mixtral 8x7B or Llama 3 have emerged and seem to be closing the gap while often offering higher throughput and being less costly to use. Open-Source LLMs can also be self-hosted, which makes them interesting for enterprise and clinical use cases where sensitive data should not be processed by third parties. We participated in the 12th BioASQ challenge, which is a retrieval augmented generation (RAG) setting, and explored the performance of current GPT models Claude 3 Opus, GPT-3.5-turbo and Mixtral 8x7b with in-context learning (zero-shot, few-shot) and QLoRa fine-tuning. We also explored how additional relevant knowledge from Wikipedia added to the context-window of the LLM might improve their performance. Mixtral 8x7b was competitive in the 10-shot setting, both with and without fine-tuning, but failed to produce usable results in the zero-shot setting. QLoRa fine-tuning and Wikipedia context did not lead to measurable performance gains. Our results indicate that the performance gap between commercial and open-source models in RAG setups exists mainly in the zero-shot setting and can be closed by simply collecting few-shot examples for domain-specific use cases. The code needed to rerun these experiments is available through GitHub*. Keywords Zero-Shot Learning, Few-Shot Learning, QLoRa fine-tuning, LLMs, BioASQ, GPT-4, RAG, Question Answering 1. Introduction Over the course of 2023, NLP benchmarks in various domains were dominated by commercial LLMs that are only accessible via APIs that make it difficult to do transparent and reproducible research [1]. They also might not be usable in clinical or enterprise use cases where sensitive data cannot be shared with third parties. In March 2023, OpenAI had to briefly take ChatGPT offline because they were accidentally leaking user-messages1 . In April 2023 Samsung had to ban the use of ChatGPT because employees shared sensitive data with the system2 . These examples show that there are real issues with the confidentiality with these services, while no competitive offline alternatives existed in early 2023. But some companies like Mistral3 and Meta4 started to publish their state-of-the-art (SOTA) LLMs with permissive licenses and are making the model weights downloadable. This makes these models especially interesting for research directed at enterprise and clinical applications, as they can be hosted on your hardware in a controlled environment. As enterprise use cases often have to deal with domain-specific data that is not publicly available to CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France ⋆ https://github.com/SamyAteia/bioasq2024 $ Samy.Ateia@stud.uni-regensburg.de (S. Ateia); udo.kruschwitz@ur.de (U. Kruschwitz) € https://www.uni-regensburg.de/language-literature-culture/information-science/team/samy-ateia-msc (S. Ateia); https://www.uni-regensburg.de/language-literature-culture/information-science/team/udo-kruschwitz/ (U. Kruschwitz)  0009-0000-2622-9194 (S. Ateia); 0000-0002-5503-0341 (U. Kruschwitz) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://web.archive.org/web/20240503032019/https://openai.com/index/march-20-chatgpt-outage/ 2 https://web.archive.org/web/20240518030412/https://techcrunch.com/2023/05/02/samsung-bans-use-of-generative-ai-tools-like-chatgpt-after-apr 3 https://mistral.ai/news/mixtral-of-experts/ 4 https://llama.meta.com/llama3/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the LLMs during pre-training, retrieval augmented generation (RAG) [2] is often used to enable these models to understand new concepts and be more helpful and grounded in their responses [3]. Several software vendors are publishing marketing articles to advertise the usefulness of their solutions to enable RAG for enterprises 567 . The BioASQ challenge is a great example of a RAG setup in a specialized domain, as the participating systems first have to find relevant biomedical papers from PubMed and extract snippets that are later used to generate answers for biomedical questions. We set out to explore the usefulness and competitiveness of open-source models to the current SOTA commercial offerings in a typical domain-specific RAG setup represented by the BioASQ challenge. Compared to our last year’s approach where we only looked at the zero-shot performance of commercial models, we now explored few-shot learning because we saw that this enables open-source models to better follow instructions while also improving overall performance. Another aspect that we explored was how additional relevant context retrieved from Wikipedia might aid the models in generating useful answers or relevant queries, as they might be limited in their biomedical knowledge about entities and their synonyms. 1.1. BioASQ Challenge BioASQ is "a competition on large-scale biomedical semantic indexing and question answering"[4]. It is held as a lab at the Conference and Labs of the Evaluation Forum (CLEF) conference8 . The current 2024 workshop is the 12th installment of the BioASQ competition9 . The 12th BioASQ challenge comprises several tasks: • BioASQ Task Synergy On Biomedical Semantic QA For Developing Issues [5] • BioASQ Task B On Biomedical Semantic QA [5] • ioASQ Task MultiCardioNER On Mutiple Clinical Entity Detection In Multilingual Medical Content [6] • BioASQ Task BioNNE On Nested NER In Russian And English [7] We participated in Task B and Synergy[5]. For Task B the participants’ systems receive a list of biomedical questions that should be answered with a short paragraph style answer and some require an additional exact answer which can be one of 3 formats, yes/no, factoid (a list of up to 5 entities) or list (a list of up to 200 entities). Additionally, the systems first have to retrieve relevant papers from the PubMed annual baseline and extract relevant snippets from these papers that could aid in answering the questions. This retrieval subtask of Task B is called Phase A while the actual question answering subtask is called Phase B. For Phase B the systems also receive a set of gold snippets and documents that should help them answer the question. Task B was scheduled in 4 batches with two weeks in between and ran from March 28 to May 11. In the 12th installment, another Phase A+ was introduced, where the systems were supposed to provide answers to the questions before the gold snippets and documents were provided, relying solely on their own retrieved documents and snippets. For the Synergy task, the systems receive a similar list of questions, for which they also have to retrieve useful papers and extract snippets and as soon as a question is marked as ready to answer, they also need to submit answers in the same format as for Task B. The difference between Synergy and Task B is that initially in the first round no gold set of documents and snippets is provided, instead the submitted documents and snippets by the systems are evaluated by biomedical experts and selected as gold reference items for subsequent rounds. This also means that the same questions might be reintroduced 5 https://cohere.com/blog/five-reasons-enterprises-are-choosing-rag 6 https://www.pinecone.io/learn/retrieval-augmented-generation/ 7 https://gretel.ai/blog/what-is-retrieval-augmented-generation 8 https://clef2024.clef-initiative.eu/ 9 http://www.bioasq.org/ in subsequent rounds, possibly with additional questions and positive and negative feedback on the previously submitted documents. Following this introduction, we will highlight some related work in Section 2, describe our methodol- ogy in Section 3, report our results in Section 4 and discuss them in Section 5. Section 6 will present some ethical considerations, and Section 7 offers our conclusions. 2. Related Work We will briefly introduce the related work that led to the creation of the evaluated models, as well as the approaches that inspired our methodology. 2.1. GPT Models Nearly all the popular SOTA LLMs that are used across various NLP tasks and use cases today are based on the transformer architecture [8]. With the generative pretrained transformer (GPT) [9] being a popular variant. These models undergo pre-training on vast amounts of text by solving the next-token prediction task [9]. Afterwards, the models are fine-tuned to align with human preference data [10] which enables them to follow instructions and be useful in direct interactions with users. OpenAI was the first company that released such a fine-tuned model to the public in November 202210 which sparked massive interest in generative artificial intelligence research and products. Their latest model at the time of writing that is powering their ChatGPT product was GPT-4 [11]. One interesting competitor model that we also used during this competition is Claude 3 Opus11 by Anthropic, which reached GPT-4 level performance (GPT-4-0125-preview) at the time of the BioASQ competition12 . The exact architecture of GPT-4 and Claude Opus 3 and other commercial models is unknown. GPT-4 is the most expensive and slowest model that OpenAI is offering via their API service. A more affordable alternative that they are offering is GPT-3.5-turbo. We compared both these models’ performance in last year’s BioASQ competition and were able to show that GPT-3.5-turbo was sometimes performing better than GPT-4 in some question formats and subtasks of the competition [12]. For this year’s BioASQ competition we also used Mixtral 8x7B [13] a downloadable open-source model (Apache 2.0 license) which uses a type of Mixture-of-Experts Architecture [14][15]. This architecture offers higher computational efficiency by routing requests to expert subnetworks to generate a response. Since only some specialized experts are active during the generation, less computation and memory is needed to serve the request. 2.2. Few and Zero-Shot Learning Few-Shot learning is the ability of LLMs to learn how to solve a new problem that they were not specifically fine-tuned for by only showing them a few examples. When GPT-3 was first introduced, its impressive few-shot learning abilities made the concept popular [16] because it greatly reduces the need for expensive training data. Zero-shot learning [17] takes this concept a step further by only requiring an abstract task description or direct question, which leads the model to ideally generate a useful completion that solves the task at hand [18]. In last year’s BioASQ competition, we were able to win some batches while only using zero-shot learning with SOTA commercial models [12]. 2.3. Adapter Fine-Tuning Current LLMs have billions of parameters and require specialized hardware with enough GPUs and VRAM to hold all the model weights in memory. For example, "16-bit finetuning of a LLaMA 65B 10 https://web.archive.org/web/20240502090536/https://openai.com/index/chatgpt/ 11 https://web.archive.org/web/20240516173322/https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf 12 https://chat.lmsys.org/?leaderboard parameter model requires more than 780 GB of GPU memory"[19]. Since this makes fine-tuning these models prohibitively expensive for many researchers and users, several clever techniques have been invented to reduce these hardware requirements. We wanted to fine-tune Mixtral 8x7B, which roughly takes up the memory of a 47B model13 . One popular approach is QLora by Dettmers et al. [19] where the model weights are quantized to 4 bits and frozen and only some low rank adapters (LoRa)[20] are fine-tuned. This would enable fine-tuning Mixtral 8x7B, for example, on only two RTX A6000 GPUs with 2x 48 GB of VRAM. 2.4. Retrieval Augmented Generation (RAG) Retrieval augmented generation (RAG) is a technique [2] that combines information retrieval with language models to enhance their ability to generate relevant and factual text. In RAG, the language model is augmented with an external knowledge base or other information source, such as a collection of documents or web pages. When generating text, the model first retrieves relevant information, based on the input query, and then uses that information to guide the generation process. This process is applied in the BioASQ challenge, where the relevant information source is the annual baseline of PubMed. RAG has been shown to improve the factual accuracy of generated text compared to standalone language models [21]. It allows the model to access a vast amount of external knowledge and incorporate it into the generated output. RAG is particularly useful for tasks that require domain-specific knowledge or up-to-date information [3]. 2.5. Professional Search Professional search is conducted in a professional context, often to aid in work-related research tasks [22]. In some professional search settings, highly trained specialists are needed to create documented and reproducible search strategies, this sets professional search apart from everyday web search [23]. The BioASQ challenge exemplifies one possible professional search setting where biomedical experts aim to find answers to domain-specific questions with sufficient evidence. Other examples of professional search might be systematic reviews [24], patent-search or search conducted by recruitment professionals [25]. All of these settings might require complex search strategies, where the search expert makes use of a query syntax involving boolean operators on specific search fields. Systematic reviews, for example, also require the search to be explainable and reproducible, which makes it difficult to use advanced vector-based retrieval techniques. Formulating traditional queries but with large language models that might be able to expand synonyms and related terms based on their semantic representations is therefore an interesting approach that might aid in professional search settings. We set out to explore this approach in the BioASQ challenge. 3. Methodology 3.1. Model In this year’s BioASQ competition, we looked at the commercial offerings GTP-3.5-turbo and GPT-4 from OpenAI and also used Antrophics Claude 3 Opus, which was at the time of the run submissions the only other model that was on a level with GPT-4 according to the LMSYS Chatbot Arena Leaderboard [26]. Since last year’s BioASQ competition, some competitive Open-Source models were published. The most notable ones being the Llama series models by Meta [27], with the latest being Llama 3 [28]. Llama comes with its own custom License which is quite permissive, but even though commercial use is allowed under this license, as long as the monthly user base does not exceed 700 million users, 13 https://mistral.ai/news/mixtral-of-experts/ Listing 1: Query Expansion Prompt {"role": "user", "content": f"""Turn the following biomedical question into an effective elasticsearch query using the query_string query type by incorporating synonyms and additional terms that closely relate to the main topic and help reduce ambiguity. Focus on maintaining the query’s precision and relevance to the original question, the index contains the fields ’title’ and ’abstract’, return valid json: ’{question}’ """} the license might not be straightforward to adopt for enterprise use cases when licenses have to be pre-approved by a legal team. When we prepared our runs for the competition, the best-performing model with a permissive open-source license (Apache 2.0) on the LMSYS leaderboard was Mixtral 8x7B [13]. The model also has a large context length of 32k tokens, which makes it especially interesting for RAG use cases or few-shot learning. We therefore choose Mixtral 8x7B as our open-source competitor model for this competition. During the competition, the newer Mixtral 8x22B model was also published, and we used it in some batches of Task B. We used the commercial hosting service fireworks.ai14 to access and fine-tune Mixtral 8x7B, as the provided speed was very high and costs for their API usage were low. 3.2. Synergy We downloaded and indexed the annual PubMed baseline from the official website15 . We indexed both the title and the abstract of all papers in separate fields of our index using the built-in English analyzer of Elasticsearch16 . For every round of synergy, the most recent snapshots of 2024 up to the date considered in that round were downloaded and indexed in another similar index, which was then also searched during the runs. For synergy, we used both gpt-4-0125-preview and gpt-3.5-turbo-0125, the newest available versions of OpenAI’s GPT-4 and GPT-3.5-turbo at the time of the competition. We used 2-shot learning to generate queries for our Elasticsearch PubMed index and zero-shot learning for extracting and reranking snippets as well as answering questions. We also wanted to use Mixtral 8x7B in this task, but the model was unable to follow instructions well enough to produce usable runs, especially in the zero-shot setting. Given a question, we prepended the prompt in Listing 1 with two examples where the same prompt contained other questions, for example "Is CircRNA produced by back splicing of exon, intron or both, forming exon or intron circRNA?" and an ideal completion in form of an Elasticsearch query for the query_string endpoint of Elasticsearch for which an example can be seen in Listing 2. We sent both the examples and the prompt with our actual question to the model and received back a JSON object that could be used to directly query our index. We ran the generated query to retrieve the top 50 relevant documents from Elasticsearch. We filtered out documents that were marked as irrelevant in the feedback file for the synergy round. We then sent each remaining article alongside the question to the model and used a zero-shot prompt to extract a list of relevant snippets from the article. We then used string matching to insure the returned snippets were actually present in the article title or abstract and to calculate the offsets. We collected all relevant snippets from all potentially 50 articles and filtered out articles as irrelevant where the model did not extract any snippets from. We then prompted the model to select the top 10 snippets ranked by helpfulness from the set of snippets. We finally reranked the retrieved articles according to the snippet order returned by this step. 14 https://fireworks.ai/ 15 https://pubmed.ncbi.nlm.nih.gov/download/ 16 https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#english-analyzer Listing 2: Query Expansion Competion Example {"role": "assistant", "content": """ { "query": { "query_string": { "query": "(CircRNA OR \"circular RNA\") \"back splicing\" exon OR intron", "fields": [ "title^10", "abstract" ], "default_operator": "and" } }, "size": 50 } """}, In the question answering step, we used the identical zero-shot prompts from our last year’s partici- pation in Task B for this year’s Synergy task [12]. And we also merged the already deemed relevant snippets from the feedback files into our list of snippets that we passed on to the modal alongside the question prompt. We also sent the same initial system prompt that we used last year [12] to the models. For the parameters, we set the temperature parameter to 0 to reduce randomness in the completion and supplied a seed parameter which is a new feature offered by OpenAI that can help maximize reproducibility for the model output, but determinism is still not guaranteed17 . We also used the new response_format parameter to insure the model produced valid JSON responses for the prompts where we needed it18 . The exact python notebooks with all the implementation details and prompts used are available in our GitHub repository. 3.3. Task 12 B For Task B we reused the indexed PubMed annual snapshot that we created from the synergy task. We also switched the models, we added Mixtral 8x7B Instruct v0.1 as an open-source model, and we used Claude 3 Opus instead of GPT-4 because it became available shortly before the Phase started, and we had access to a free beta evaluation account. In batches 1 & 2 we also explored the fine-tuning service of OpenAI and created 6 fine-tuned versions of GPT-3.5-turbo for each sub problem that our system had to solve. We also created 6 QLoRa fine-tuned versions of Mixtral 8x7B Instruct v0.1 using the fine-tuning service of fireworks.ai. The training sets were created from the supplied training data. We also adjusted our code to be able to use these training files as sources for few-shot examples. The sub problems that we sampled training sets for were: • Snippet Extraction • Snippet Reranking & Selection • Summary Question Answering • Exact yes/no Question Answering • Exact factoid Question Answering • Exact list Question Answering 17 https://platform.openai.com/docs/api-reference/chat/create#chat-create-seed 18 https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format Listing 3: Wikipedia Retrieval Prompt prompt = f""" Given the question "{question}", identify existing Wikipedia articles that offer helpful background information to answer this question. Ensure that the titles listed are of real articles on Wikipedia as of your last training cut−off. Wrap the confirmed article titles in hashtags (e.g., #Article Title#). Provide a step−by−step reasoning for your selections, ensuring relevance to the main components of the question. Step 1: Confirm the Existence of Articles Before listing any articles, briefly verify their existence by ensuring they are well−known topics generally covered by Wikipedia. Step 2: List Relevant Wikipedia Articles After confirming, list the articles, wrapping the titles in hashtags and explaining how each article is relevant to the question. """ In batches 3&4 we explored how we could use the models to retrieve relevant additional context from Wikipedia. We hypothesized that supplying these models with knowledge about relevant entities in the questions might improve their ability to generate correct answers. We wanted to explore the approach of retrieving such additional information about entities from a Wiki, because such a wiki like could also be generated in an enterprise setting, potentially closing the knowledge gap for entities that the models didn’t encounter during pre-training. We chose Wikipedia as a knowledge base even though the concepts described there might not be novel to the models because it was easy to use, and we hoped that we could observe an effect even with known concepts. We suspected that these models might just "know" which Wikipedia articles might be relevant to a question because they might have been highly trained on Wikipedia and links to Wikipedia articles. The exact zero-shot prompt that was used for finding relevant Wikipedia articles can be seen in Listing 3. The titles of the Wikipedia articles returned by this prompt were extracted and, if the Wikipedia articles actually existed, their content was retrieved and concatenated. Finally, the concatenated articles were again sent to the model, and it was prompted to make a concise summary to help answer the question. This summary was then added as additional context to all subsequent prompts such as query generation or snippet extraction, reranking and question answering. 3.3.1. Phase A In Phase A, we changed the way we prompted the models for queries compared to the Synergy task. Instead of expecting the whole valid JSON query object back, we only prompted the models to create the query string in the valid query_string query syntax to pass on to the Elasticsearch endpoint and manually controlled weighting of the fields and the document return size. We then used the questions from batch 1 from last year’s task, that were provided in the development set, to create a set of Elasticsearch queries with Claude 3 Opus. Finally, we ran these queries and evaluated the returned documents against the gold set to select 10 queries with the highest f1 score as few-shot examples for all models. We used 10 examples for most of our few-shot tasks because they fit in most models context lengths, with GPT-3.5-turbo having the smallest context length of 16k tokens. The rest of our approach to retrieving relevant documents and snippets was mostly in line with our approach from the Synergy task, except that we re-added the step that we also used last year when we prompted the models for an improved query when a generated query did not return any results. Compared to our system from last year, we changed the query expansion/generation prompt, added the possibility to prepend few-shot examples to the prompts and used different models where some were also fine-tuned. We also reranked snippets instead of titles, and filtered and ranked the articles according to their extracted snippets. We also used our own Elasticsearch index of the PubMed annual baseline instead of the online PubMed search endpoint. 3.3.2. Phase A+ and Phase B Our approach for Phase A+ and Phase B were mostly identical. We used the same prompts and few-shot examples taken from the sampled training sets for model fine-tuning. The only difference was the relevant snippets provided as context to the models. For Phase A+ we used the same snippets as input for all models, these were taken from the run in Phase A where the most snippets were found. We opted to use the same snippets, as we wanted to be able to compare the performance of the different models in Phase A+ isolated from their performance in Phase A. For Phase B we took the gold snippets provided in the run file as input. The prompts used for Phase A+ and Phase B were identical to the question answering prompts from the Synergy task and our last year’s approach. We only added the option to add additional context before the snippets and supply few-shot examples. In batches 1 & 2 we compared the performance of fine-tuned models with their non-fine-tuned counterparts, and in batches 3 & 4 we compared systems with additional relevant context taken from Wikipedia with systems that did not have this context. 4. Results We participated with our systems in two tasks of the BioASQ challenge: Synergy and Task B On Biomedical Semantic QA. For the synergy task, we report the results only for batch 3 and 4 as we were unable to participate in earlier batches. For Task B we competed in all batches and report the results of sub-tasks A (Retrieval), A+ (Q&A with own retrieved documents) and B (Q&A with gold documents). The results presented in this section are only preliminary, as the manual assessment of the system responses by the BioASQ team of biomedical experts is still ongoing. The final results will be available on the BioASQ homepage once the manual assessment is finished19 . 4.1. Synergy We participated with 2 systems in batch 3 and 4 of the Synergy task. The full result table is accessible on the BioASQ website20 . The two systems both used the same 2-shot query expansion and zero-shot Q&A approach but with different commercial models. The system names and corresponding models are listed below: • UR-IW-2: gpt-4-0125-preview. • UR-IW-3: gpt-3.5-turbo-0125 We tried to also use Mixtral 8x7B Instruct v0.1 as an open-source alternative in this task, but the model was unable to follow the zero-shot prompts for query expansion and extracting the snippets consistently enough to produce a submittable run file. Our retrieval approach for query expansion and filtering and reranking via extracted snippets was not competitive in the document retrieval stage of the task, with both models performing similarly poorly compared to the top competitors, as indicated in Table 1. The official metrics to rank systems in each subtask are highlighted in bold in the following tables21 . 19 http://participants-area.bioasq.org/results/ 20 http://participants-area.bioasq.org/results/synergy_v2024/ 21 "Top Competitor" are the systems that took the first position in a round or batch that are not ours. They are added as a reference point for the reported metrics. When "Top Competitor" is missing in a reported batch, one of our systems was the best-performing one. Table 1 Task 12 Synergy, Document Retrieval, Batches 3-4 Batch Position System Precision Recall F-Measure MAP GMAP Test round 3 1 of 9 Top Competitor 0.2156 0.2568 0.1981 0.1769 0.0195 6 of 9 UR-IW-3 0.1076 0.0619 0.0675 0.0532 0.0003 7 of 9 UR-IW-2 0.1076 0.0619 0.0675 0.0532 0.0003 Test round 4 1 of 13 Top Competitor 0.1651 0.1671 0.1459 0.1308 0.0070 5 of 13 UR-IW-3 0.0912 0.0831 0.0790 0.0664 0.0006 7 of 13 UR-IW-2 0.0792 0.0746 0.0628 0.0514 0.0003 For snippet extraction, the performance of our approach was also poor, except for batch 4 where gpt-3.5-turbo-0125 was able to achieve second place as can be seen in Table 2. Gpt-4-0125-preview was unable to extract any snippets in the same batch, which was unusual and could have been due to issues OpenAI had with serving the preview model via their API. Table 2 Task 12 Synergy, Snippet Extraction, Batches 3-4 Batch Position System Precision Recall F-Measure MAP GMAP Test round 3 1 of 9 Top Competitor 0.1751 0.1444 0.1356 0.1811 0.0019 6 of 9 UR-IW-3 0.0795 0.0467 0.0474 0.0567 0.0001 7 of 9 UR-IW-2 0.0795 0.0467 0.0474 0.0567 0.0001 Test round 4 1 of 13 Top Competitor 0.1241 0.1103 0.0982 0.1003 0.0018 2 of 13 UR-IW-3 0.0966 0.0852 0.0741 0.0989 0.0005 13 of 13 UR-IW-2 - - - - - For the question-answering stage, both models achieved perfect scores in the exact yes/no answer format, see Table 3. For factoid answers, gpt-4-0125-preview was able to take first place in batch 4 and achieved higher placements than gpt-3.5-turbo-0125 over both batches, as can be seen in Table 4. A similar difference is also observable in Table 5 for the list answer results. Table 3 Task 12 Synergy, Q&A exact Yes/No, Batches 3-4 Batch Position System Accuracy F1 Yes F1 No Macro F1 Test round 3 1 of 9 UR-IW-3 1.0000 1.0000 1.0000 1.0000 1 of 9 UR-IW-2 1.0000 1.0000 1.0000 1.0000 Test round 4 1 of 13 UR-IW-3 1.0000 1.0000 1.0000 1.0000 1 of 13 UR-IW-2 1.0000 1.0000 1.0000 1.0000 Table 4 Task 12 Synergy, Q&A exact Factoid, Batches 3-4 Batch Position System Strict Acc. Lenient Acc. MRR Test round 3 1 of 9 Top Competitor 0.4444 0.6667 0.5556 4 of 9 UR-IW-2 0.4444 0.5556 0.5000 6 of 9 UR-IW-3 0.4444 0.4444 0.4444 Test round 4 1 of 13 UR-IW-2 0.2727 0.6364 0.4318 9 of 13 UR-IW-3 0.2727 0.2727 0.2727 Completing one run with gpt-4-0125-preview cost around $12 in API fees, while the same run with gpt-3.5-turbo-0125 was around 10 times cheaper at $1.2. Gpt-4-0125-preview was also quite slow, taking around 180 minutes to complete one run, while gpt-3.5-turbo-0125 took only a few minutes. Cost significantly decreased compared to our last year’s participation, while speed increased. This enabled us to actually use GPT-4 on the snippet extraction task, while last year we were not able to complete runs with snippet extraction for GPT-4 due to time and cost constraints. We also encountered less to none API errors during our runs, except for the empty responses in batch 4 for GPT-4, while last year Table 5 Task 12 Synergy, Q&A exact List, Batches 3-4 Batch Position System Mean Prec. Recall F-Measure Test round 3 1 of 9 Top Competitor 0.4210 0.3280 0.3393 7 of 9 UR-IW-2 0.2500 0.3285 0.2669 8 of 9 UR-IW-3 0.2532 0.2710 0.2459 Test round 4 1 of 13 Top Competitor 0.3452 0.2794 0.2707 4 of 13 UR-IW-2 0.2054 0.3558 0.2395 12 of 13 UR-IW-3 0.1347 0.2203 0.1455 we often had to rerun questions because the API timed-out or returned other errors. 4.2. Task 12 B Phase A We participated with 5 systems in all 4 batches of Task 12 B Phase A. The systems either used 1- or 10-shot learning with the plain or a fine-tuned model in batches 1+2 or additional context retrieved from Wikipedia in batches 3-4. The system names and configurations are listed below. Batches 1-2: • UR-IW-1: Claude 3 Opus + 1-shot • UR-IW-2: Mixtral 8x7B Instruct v0.1, QloRa Fine-Tuned + 10-Shot • UR-IW-3: gpt-3.5-turbo-0125 fine-tuned + 1-shot • UR-IW-4: Mixtral 8x7B Instruct v0.1 + 10-shot • UR-IW-5: gpt-3.5-turbo-0125 + 10-shot Batches 3-4: • UR-IW-1: Claude 3 Opus 1-shot + wiki • UR-IW-2: Claude 3 Opus 1-shot • UR-IW-3: Mixtral 8x7B Instruct v0.1 10-Shot + wiki • UR-IW-4: Mixtral 8x7B Instruct v0.1 10-Shot • UR-IW-5: Mixtral 8x22B Instruct v0.1 10-Shot + wiki The following Tables 6 and 7 show the results of our systems participating in the 4 batches. MAP was the official metric to compare the systems. In batches 1 & 2 where we compared fine-tuned versions of GTP-3.5 and Mixtral 8x7b with 10-shot learning and Claude 3 Opus, no clear trend was observable over batches in the document retrieval stage of Phase A as can be seen in Table 6. Only Mixtral 8x7B with 10-shot learning was consistently performing worse than all our other models, While Claude 3 Opus with 1-shot learning was our best model in batch 2 and second best in batch 1. We used 1-shot learning instead of 10-shot learning for Claude 3 Opus due to time constraints because the model was slow, and for the fine-tuned gpt-3.5-turbo due to cost constraints. Sending 10 examples of abstracts for snippet extractions per 50 highest-ranked search results would have amounted to quite some input tokens per run for these models. For batch 3 & 4 where we explored if giving the systems additional Wikipedia context while creating queries for Elasticsearch could improve their performance, we also could not observe a consistent effect over batches. While in batch 3 the systems with additional Wikipedia context performed better, this effect was reversed in batch 4. In the snippet extraction stage of Phase A, our QloRa fine-tuned version of Mixtral 8x7B was our worst-performing system in both batch 1 & 2, followed by the 10-shot Mixtral 8x7B version as can be seen in Table 7. The fine-tuned version of GPT-3.5-turbo was consistently ahead of its 10-shot counterpart in both batches, while Claude 3 Opus was worse than the GPT-3.4-turbo systems in batch 1 and better in batch 2. Table 6 Task 12B Phase A, Document Retrieval Batch Position System Precision Recall F-Measure MAP GMAP Test Batch 1 1 of 40 Top Competitor 0.1039 0.3124 0.1485 0.2067 0.0016 25 of 40 UR-IW-5 0.0525 0.1093 0.0602 0.0811 0.0001 26 of 40 UR-IW-1 0.0784 0.1525 0.0938 0.0751 0.0002 29 of 40 UR-IW-3 0.0539 0.1023 0.0648 0.0631 0.0001 30 of 40 UR-IW-2 0.0544 0.0975 0.0551 0.0600 0.0001 32 of 40 UR-IW-4 0.0566 0.0861 0.0477 0.0511 0.0001 Test Batch 2 1 of 53 Top Competitor 0.0953 0.3673 0.1428 0.2293 0.0026 32 of 53 UR-IW-1 0.0889 0.1607 0.0971 0.0875 0.0002 39 of 53 UR-IW-2 0.0502 0.1227 0.0583 0.0657 0.0001 40 of 53 UR-IW-3 0.0542 0.1390 0.0716 0.0643 0.0001 43 of 53 UR-IW-5 0.0633 0.1048 0.0660 0.0564 0.0001 45 of 53 UR-IW-4 0.0694 0.0589 0.0517 0.0409 0.0000 Test Batch 3 1 of 58 Top Competitor 0.0859 0.3835 0.1309 0.2549 0.0024 27 of 58 UR-IW-1 0.0524 0.1761 0.0720 0.1281 0.0002 31 of 58 UR-IW-2 0.0541 0.1569 0.0734 0.1217 0.0001 40 of 58 UR-IW-3 0.0687 0.1730 0.0766 0.0971 0.0002 42 of 58 UR-IW-5 0.0664 0.1866 0.0854 0.0957 0.0003 52 of 58 UR-IW-4 0.0446 0.0859 0.0492 0.0480 0.0001 Test Batch 4 1 of 49 Top Competitor 0.1000 0.5569 0.1609 0.3930 0.0148 17 of 49 UR-IW-2 0.1199 0.3769 0.1586 0.2910 0.0018 22 of 49 UR-IW-1 0.0952 0.2810 0.1253 0.1892 0.0006 23 of 49 UR-IW-4 0.0934 0.2686 0.1224 0.1819 0.0005 25 of 49 UR-IW-5 0.0870 0.2231 0.1099 0.1617 0.0003 35 of 49 UR-IW-3 0.0681 0.1861 0.0844 0.1281 0.0001 Table 7 Task 12B Phase A, Snippet Extraction Batch Position System Precision Recall F-Measure MAP GMAP Test Batch 1 1 of 40 Top Competitor 0.0446 0.1490 0.0638 0.1149 0.0001 7 of 40 UR-IW-3 0.0454 0.0539 0.0458 0.0452 0.0001 10 of 40 UR-IW-5 0.0450 0.0546 0.0441 0.0412 0.0000 11 of 40 UR-IW-1 0.0480 0.0720 0.0508 0.0357 0.0001 12 of 40 UR-IW-4 0.0444 0.0527 0.0336 0.0244 0.0001 13 of 40 UR-IW-2 0.0341 0.0483 0.0276 0.0237 0.0000 Test Batch 2 1 of 53 Top Competitor 0.0520 0.1810 0.0746 0.1539 0.0003 6 of 53 UR-IW-1 0.0568 0.0850 0.0532 0.0569 0.0001 11 of 53 UR-IW-3 0.0400 0.0722 0.0474 0.0345 0.0001 13 of 53 UR-IW-5 0.0357 0.0474 0.0333 0.0301 0.0000 17 of 53 UR-IW-4 0.0590 0.0449 0.0334 0.0230 0.0000 18 of 53 UR-IW-2 0.0329 0.0713 0.0278 0.0191 0.0000 Test Batch 3 1 of 58 Top Competitor 0.0666 0.2568 0.0940 0.2224 0.0009 6 of 58 UR-IW-1 0.0379 0.1251 0.0508 0.0818 0.0002 7 of 58 UR-IW-5 0.0399 0.1188 0.0506 0.0736 0.0002 8 of 58 UR-IW-2 0.0359 0.0881 0.0456 0.0677 0.0001 20 of 58 UR-IW-3 0.0320 0.0819 0.0338 0.0402 0.0001 21 of 58 UR-IW-4 0.0316 0.0388 0.0279 0.0381 0.0000 Test Batch 4 1 of 49 Top Competitor 0.0782 0.4162 0.1191 0.3437 0.0043 6 of 49 UR-IW-2 0.0777 0.1846 0.0888 0.1402 0.0008 8 of 49 UR-IW-1 0.0502 0.1398 0.0645 0.0959 0.0002 10 of 49 UR-IW-4 0.0559 0.0848 0.0566 0.0661 0.0001 11 of 49 UR-IW-5 0.0586 0.1208 0.0654 0.0617 0.0001 14 of 49 UR-IW-3 0.0400 0.0486 0.0329 0.0428 0.0000 Additional Wikipedia context did not lead to consistent results across batches. While the systems with Wikipedia context (UR-IW-1, UR-IW-3) performed better than their counterparts (UR-IW-2, UR-IW-4) in batch 3, this effect was again reversed in batch 4. One run with Claude 3 Opus with 1-shot learning and additional Wikipedia context took around 140 minutes to complete, we did not have to pay for the tokens used because we had an early beta evaluation account. For Mixtral 8x7B with 10-shot learning and additional Wikipedia context, the runs took around 14 minutes to complete via the fireworks.ai API and cost around $ 11. Fireworks.ai charged $0.50 /1M tokens for both input and output tokens as of writing, while Anthropic would have charged $ 15 /1M tokens for input and $ 75 /1M tokens for output. So the cost for doing 10-shot learning with Claude 3 Opus would have been at least 30 times as high while being 10 times slower. 4.3. Task 12B Phase A+ We participated with 5 systems in nearly all 4 batches of Task 12 B Phase A+22 . The systems either used 10-shot learning with the plain or a fine-tuned model in batches 1+2 or additional context retrieved from Wikipedia in batches 3-4. Per batch, we used the same input snippet file for all systems to base their answers on to ensure that their performance is comparable. Batches 1-2: • UR-IW-1: Claude 3 Opus + 10-shot • UR-IW-2: Mixtral 8x7B Instruct v0.1, QLoRa Fine-Tuned + 10-Shot • UR-IW-3: gpt-3.5-turbo-0125 fine-tuned + 10-shot • UR-IW-4: Mixtral 8x7B Instruct v0.1 + 10-shot • UR-IW-5: gpt-3.5-turbo-0125 + 10-shot Batches 3-4: • UR-IW-1: Claude 3 Opus 10-shot + wiki • UR-IW-2: Mixtral 8x22B Instruct v0.1 10-Shot • UR-IW-3: Mixtral 8x7B Instruct v0.1 10-Shot + wiki • UR-IW-4: Mixtral 8x7B Instruct v0.1 10-Shot • UR-IW-5: Mixtral 8x22B Instruct v0.1 10-Shot + wiki Table 8 Task 12B Phase A+, exact questions Yes/No Batch Position System Accuracy F1 Yes F1 No Macro F1 Test Batch 1 1 of 22 UR-IW-3 0.9200 0.9333 0.9000 0.9167 4 of 22 UR-IW-4 0.8400 0.8462 0.8333 0.8397 5 of 22 UR-IW-2 0.8400 0.8462 0.8333 0.8397 6 of 22 UR-IW-5 0.8000 0.8148 0.7826 0.7987 8 of 22 UR-IW-1 0.8000 0.8276 0.7619 0.7947 Test Batch 2 1 of 26 Top Competitor 0.9615 0.9677 0.9524 0.9601 2 of 26 UR-IW-5 0.8846 0.8966 0.8696 0.8831 3 of 26 UR-IW-3 0.8846 0.8966 0.8696 0.8831 6 of 26 UR-IW-4 0.8462 0.8571 0.8333 0.8452 7 of 26 UR-IW-2 0.8462 0.8571 0.8333 0.8452 12 of 26 UR-IW-1 0.7692 0.8000 0.7273 0.7636 Test Batch 3 1 of 28 UR-IW-5 0.9167 0.9286 0.9000 0.9143 8 of 28 UR-IW-2 0.8333 0.8462 0.8182 0.8322 10 of 28 UR-IW-1 0.8333 0.8667 0.7778 0.8222 11 of 28 UR-IW-3 0.7917 0.8000 0.7826 0.7913 13 of 28 UR-IW-4 0.7917 0.8148 0.7619 0.7884 Test Batch 4 1 of 29 Top Competitor 0.8889 0.9189 0.8235 0.8712 3 of 29 UR-IW-1 0.8519 0.8947 0.7500 0.8224 4 of 29 UR-IW-2 0.8519 0.8947 0.7500 0.8224 9 of 29 UR-IW-4 0.7778 0.8333 0.6667 0.7500 11 of 29 UR-IW-5 0.7407 0.8000 0.6316 0.7158 14 of 29 UR-IW-3 0.7037 0.7647 0.6000 0.6824 22 We failed to submit one system run for system number 5 in batch 3. For the yes/no exact answer format, Claude 3 Opus with 10-shot learning was our worst-performing system, while the fine-tuned version of GPT-3.5-turbo was our top-performing system in batch 1 and only beaten by its 10-shot counter-part in batch 2, as can be seen in Table 8. It was interesting to see that the open-source models could perform better than the presumably most advanced commercial model, Claude 3 Opus, in this task. For batches 3 & 4, we could show that additional Wikipedia context led to inconsistent results. While this context improved performance in batch 3 for the Wikipedia enhanced systems (UR-IW-5, UR-IW-3) over their normal 10-shot counterparts (UR-IW-2, UR-IW-4) it again led to worse performance in batch 4. This result is in line with the results from Phase A where these systems performed similarly for document retrieval and snippet extraction. We speculate that the models are sensitive to the Wikipedia context, and the usefulness of the context is highly influenced by both the entities present in the questions, and its relationship to the relevant snippets. Table 9 Task 12 B, Phase A+, exact questions factoid Batch Position System Strict Acc. Lenient Acc. MRR Test Batch 1 1 of 22 Top Competitor 0.2381 0.5238 0.3611 4 of 22 UR-IW-1 0.1905 0.2381 0.2143 10 of 22 UR-IW-5 0.0952 0.0952 0.0952 12 of 22 UR-IW-2 0.0952 0.0952 0.0952 13 of 22 UR-IW-3 0.0952 0.0952 0.0952 14 of 22 UR-IW-4 0.0476 0.0952 0.0714 Test Batch 2 1 of 26 Top Competitor 0.3684 0.4211 0.3947 3 of 26 UR-IW-5 0.3158 0.3158 0.3158 4 of 26 UR-IW-3 0.3158 0.3158 0.3158 7 of 26 UR-IW-2 0.2632 0.3158 0.2895 9 of 26 UR-IW-1 0.2632 0.2632 0.2632 14 of 26 UR-IW-4 0.1579 0.2105 0.1842 Test Batch 3 1 of 28 Top Competitor 0.2692 0.4231 0.3301 8 of 28 UR-IW-1 0.1923 0.3077 0.2340 10 of 28 UR-IW-4 0.1923 0.2308 0.2019 11 of 28 UR-IW-2 0.1538 0.1923 0.1731 16 of 28 UR-IW-3 0.1538 0.1538 0.1538 17 of 28 UR-IW-5 0.1538 0.1538 0.1538 Test Batch 4 1 of 29 Top Competitor 0.3684 0.4211 0.3947 2 of 29 UR-IW-1 0.3158 0.4737 0.3816 3 of 29 UR-IW-5 0.3684 0.3684 0.3684 4 of 29 UR-IW-2 0.3158 0.3684 0.3421 8 of 29 UR-IW-3 0.2105 0.3158 0.2412 10 of 29 UR-IW-4 0.1579 0.2632 0.2018 In the exact answer factoid format, our worst-performing system in batches 1 & 2 was consistently Mixtral 8x7B with 10-shot learning while its fine-tuned counterpart performed better as can be seen in Table 9. This order was reversed for GPT-3.5-turbo, where the fine-tuned version performed worse than its counterpart. The additional Wikipedia context again led to inconsistent results across batches, but this time the systems with Wikipedia context performed better in batch 4 compared to batch 3, which is contrary to the observed behavior in the document retrieval and snippet extraction in Phase A as well as the yes/no answer format in Phase A+. For the list exact answer format, Mixtral 8x7B was again our worst-performing system in batch 1 & 2, while its fine-tuned counterpart was competing with the fine-tuned version of gpt-3.5-turbo for the top positions as can be seen in Table 10. In batches 3 & 4 Claude 3 Opus with 10-shot learning and additional Wikipedia context was the best- performing system in both batches while Mixtral 8x7B with 10-shot learning and additional Wikipedia context was the worst-performing one. Overall no consistent effect of the Wikipedia context was observable across models and batches. Table 10 Task 12 B, Phase A+, exact questions list Batch Position System Mean Prec. Recall F-Measure Test Batch 1 1 of 22 UR-IW-2 0.5250 0.4914 0.4808 3 of 22 UR-IW-3 0.4016 0.4778 0.4089 4 of 22 UR-IW-5 0.4119 0.4182 0.3976 5 of 22 UR-IW-4 0.3948 0.4063 0.3798 7 of 22 UR-IW-1 0.3224 0.4273 0.3418 Test Batch 2 1 of 26 Top Competitor 0.4470 0.4451 0.4088 7 of 26 UR-IW-3 0.2625 0.2400 0.2411 8 of 26 UR-IW-2 0.2045 0.2569 0.2182 9 of 26 UR-IW-4 0.2628 0.2299 0.2179 12 of 26 UR-IW-1 0.1953 0.1906 0.1766 14 of 26 UR-IW-5 0.1589 0.1725 0.1497 Test Batch 3 1 of 28 Top Competitor 0.3750 0.4069 0.3708 4 of 28 UR-IW-1 0.2657 0.4232 0.3000 10 of 28 UR-IW-5 0.2208 0.2881 0.2392 12 of 28 UR-IW-2 0.2125 0.2892 0.2303 13 of 28 UR-IW-4 0.2014 0.2655 0.2186 19 of 28 UR-IW-3 0.1373 0.2326 0.1627 Test Batch 4 1 of 29 Top Competitor 0.3139 0.3433 0.3219 8 of 29 UR-IW-1 0.1529 0.2641 0.1774 14 of 29 UR-IW-4 0.1364 0.1845 0.1418 15 of 29 UR-IW-2 0.1161 0.2069 0.1366 17 of 29 UR-IW-5 0.1125 0.1610 0.1269 18 of 29 UR-IW-3 0.1191 0.1239 0.1155 4.4. Task 12B Phase B We participated with 5 systems in all 4 batches of Task 12B Phase B. The systems used 10-shot learning with the plain or a fine-tuned model in batches 1 & 2 or additional context retrieved from Wikipedia in batches 3 & 4. Batches 1-2: • UR-IW-1: Claude 3 Opus + 10-shot • UR-IW-2: Mixtral 8x7B Instruct v0.1, QLoRa Fine-Tuned + 10-Shot • UR-IW-3: gpt-3.5-turbo-0125 fine-tuned + 10-shot • UR-IW-4: Mixtral 8x7B Instruct v0.1 + 10-shot • UR-IW-5: gpt-3.5-turbo-0125 + 10-shot Batch 3 • UR-IW-1: Claude 3 Opus 10-shot + wiki • UR-IW-2: Mixtral 8x22B Instruct v0.1 10-Shot • UR-IW-3: Mixtral 8x7B Instruct v0.1 10-Shot + wiki • UR-IW-4: Mixtral 8x7B Instruct v0.1 10-Shot • UR-IW-5: Mixtral 8x22B Instruct v0.1 10-Shot + wiki Batch 4 • UR-IW-1: Claude 3 Opus 10-shot + wiki • UR-IW-2: Claude 3 Opus 10-shot • UR-IW-3: Mixtral 8x7B Instruct v0.1 10-Shot + wiki • UR-IW-4: Mixtral 8x7B Instruct v0.1 10-Shot • UR-IW-5: Mixtral 8x22B Instruct v0.1 10-Shot + wiki Table 11 Task 12 B, Phase B, exact Yes/No Batch Position System Accuracy F1 Yes F1 No Macro F1 Test Batch 1 1 of 39 UR-IW-1 0.9600 0.9655 0.9524 0.9589 2 of 39 UR-IW-5 0.9600 0.9655 0.9524 0.9589 5 of 39 UR-IW-2 0.9200 0.9231 0.9167 0.9199 6 of 39 UR-IW-4 0.9200 0.9286 0.9091 0.9188 7 of 39 UR-IW-3 0.9200 0.9286 0.9091 0.9188 Test Batch 2 1 of 43 UR-IW-3 0.9615 0.9677 0.9524 0.9601 4 of 43 UR-IW-1 0.9615 0.9697 0.9474 0.9585 8 of 43 UR-IW-2 0.9231 0.9375 0.9000 0.9188 9 of 43 UR-IW-5 0.9231 0.9375 0.9000 0.9188 26 of 43 UR-IW-4 0.8462 0.8667 0.8182 0.8424 Test Batch 3 1 of 48 Top Competitor 1.0000 1.0000 1.0000 1.0000 17 of 48 UR-IW-1 0.9167 0.9286 0.9000 0.9143 23 of 48 UR-IW-2 0.8750 0.8800 0.8696 0.8748 26 of 48 UR-IW-3 0.8750 0.8889 0.8571 0.8730 27 of 48 UR-IW-4 0.8750 0.8889 0.8571 0.8730 Test Batch 4 1 of 49 Top Competitor 0.9630 0.9730 0.9412 0.9571 8 of 49 UR-IW-1 0.9259 0.9444 0.8889 0.9167 19 of 49 UR-IW-2 0.8889 0.9231 0.8000 0.8615 20 of 49 UR-IW-4 0.8519 0.8889 0.7778 0.8333 25 of 49 UR-IW-5 0.8148 0.8571 0.7368 0.7970 31 of 49 UR-IW-3 0.5926 0.5926 0.5926 0.5926 In the exact yes/no answer settings of Phase B, Claude 3 Opus with 10-shot learning and gpt-3.5 turbo with 10-shot learning were sharing first place in batch 1 while the fine-tuned version of gpt-3.5-turbo was the best-performing system in batch 2 as can be seen in Table 11. The fine-tuned version of Mixtral 8x7B was also competitive, taking the 5th position in batch 1 and 8th position in batch 2. In batch 3 & 4 with additional wikipedia context the systems performed better than their counterparts in batch 3 while this result was mixed in batch 4, leading again to inconsistent results. Table 12 Task 12B, Phase B, exact factoid Batch Position System Strict Acc. Lenient Acc. MRR Test Batch 1 1 of 39 Top Competitor 0.4286 0.4286 0.4286 11 of 39 UR-IW-1 0.1905 0.3333 0.2540 12 of 39 UR-IW-5 0.2381 0.2857 0.2540 14 of 39 UR-IW-2 0.2381 0.2381 0.2381 15 of 39 UR-IW-3 0.2381 0.2381 0.2381 23 of 39 UR-IW-4 0.1905 0.1905 0.1905 Test Batch 2 1 of 43 UR-IW-1 0.6316 0.7368 0.6842 2 of 43 UR-IW-2 0.6842 0.6842 0.6842 3 of 43 UR-IW-4 0.6316 0.6316 0.6316 8 of 43 UR-IW-3 0.5263 0.5263 0.5263 14 of 43 UR-IW-5 0.4211 0.4211 0.4211 Test Batch 3 1 of 48 Top Competitor 0.5000 0.5000 0.5000 7 of 48 UR-IW-2 0.3846 0.3846 0.3846 8 of 48 UR-IW-3 0.3462 0.4231 0.3846 13 of 48 UR-IW-4 0.3462 0.3846 0.3654 26 of 48 UR-IW-1 0.2692 0.3077 0.2885 Test Batch 4 1 of 49 UR-IW-2 0.6316 0.6842 0.6579 5 of 49 UR-IW-5 0.5789 0.5789 0.5789 12 of 49 UR-IW-1 0.4737 0.6316 0.5439 15 of 49 UR-IW-4 0.4737 0.5789 0.5175 27 of 49 UR-IW-3 0.3684 0.3684 0.3684 For the exact factoid answer format in Phase B, Claude 3 Opus and our fine-tuned Mixtral 8x7B model where sharing first place in batch 2 while in batch 3, Claude 3 Opus and gpt-3.5-turbo where the on the same level as can be seen in Table 12. For Mixtral 8x7B, additional Wikipedia context improved the outcome in batch 3 but led to worse results in batch 4, while a similar effect was observable for Claude 3 opus in batch 423 . Table 13 Task 12 B, Phase B, exact List Batch Position System Mean Prec. Recall F-Measure Test Batch 1 1 of 39 Top Competitor 0.6647 0.5804 0.5843 3 of 39 UR-IW-5 0.6054 0.5942 0.5790 5 of 39 UR-IW-3 0.6010 0.5799 0.5656 13 of 39 UR-IW-2 0.5202 0.4947 0.4992 18 of 39 UR-IW-1 0.4840 0.5069 0.4662 22 of 39 UR-IW-4 0.4563 0.3903 0.4015 Test Batch 2 1 of 43 UR-IW-4 0.5863 0.5645 0.5708 2 of 43 UR-IW-2 0.5835 0.5645 0.5698 4 of 43 UR-IW-3 0.5650 0.5347 0.5434 8 of 43 UR-IW-1 0.5061 0.5246 0.5047 9 of 43 UR-IW-5 0.5009 0.5347 0.5033 Test Batch 3 1 of 48 Top Competitor 0.6466 0.6560 0.6484 3 of 48 UR-IW-4 0.5656 0.5696 0.5611 8 of 48 UR-IW-2 0.5031 0.5367 0.5093 15 of 48 UR-IW-3 0.4451 0.4578 0.4473 20 of 48 UR-IW-1 0.3476 0.6010 0.4118 Test Batch 4 1 of 49 Top Competitor 0.7680 0.6266 0.6637 7 of 49 UR-IW-2 0.6209 0.6612 0.6299 16 of 49 UR-IW-5 0.5097 0.5044 0.4989 18 of 49 UR-IW-4 0.4919 0.4839 0.4699 19 of 49 UR-IW-3 0.4641 0.5067 0.4667 22 of 49 UR-IW-1 0.3634 0.5532 0.4260 For the exact list answer format in Phase B, Mixtral 8x7B with 10-shot learning took first place in batch 2 while being on place 22 of 39 in batch 1 where gpt-3-5-turbo with 10-shot learning was our best-performing one as can be seen in Table 13. For batches 3 & 4 both Claude 3 Opus and Mixtral 8x7B performed worse with additional Wikipedia context across batches. In batch 3 our best-performing system was Mixtral 8x7B with 10-shot learning and in batch 4 it was Claude 3 Opus with 10-shot learning. The costs for completing runs in Phase B were lower than in Phase A and the runs were faster because we did not have to do snippet extraction for 50 documents per question, times the number of few-shot examples, we therefore were able to also do 10-shot learning with Claude 3 Opus and the fine-tuned version of GPT-3.5-turbo. The processing with Mixtral 8x7B via the Fireworks.ai API only took around 30 seconds for plain 10-shot examples and around 2 minutes for 10-shot examples and additional Wikipedia context. We also submitted ideal answers for Task B and A+, but do not report on the preliminary results here, as the official judging metric for this answer type is based on the manual judgements that are not available yet. 5. Discussion and Future Work While testing both commercial and open-source models, we could observe that there was no clear dominating model across batches or sub-tasks. Even our presumably weakest model, Mixtral 8x7B Instruct v0.1 with 10-shot learning was able to secure some leading spots in some batches of the competition, beating all other competing systems, (see batch 3 in Table 8 and batch 2 in Table 13). We 23 The run of Mixtral 8x22B with Wikipedia context was not successfully submitted to batch 3, we might have overlooked to upload them. speculate that both the RAG setting and 10-shot learning might level the playing field a bit between commercial and open-source models, and it indicates that there is clear potential for creating state-of- the-art systems even with cheaper, faster and presumably smaller open-source models, if they are used in the right way. While the Mixtral model weights are publicly available, their training data is not published, which makes these models not ideal candidates for scientific research. A truly open-source LLM alternative is OLMo [29] published by the Allen Institute for AI. We choose Mixtral nevertheless because we wanted to study models that might be used by commercial practitioners in clinical or enterprise use cases, and we think the permissive license combined with the seemingly competitive performance on public benchmarks and its large context length makes it an ideal candidate for these use cases. From our results with our experiments with additional Wikipedia context in BioASQ we could see that it had an impact on performance, but it was inconsistent across question batches. For some questions in some subtasks it led to improvements, while for others the performance declined. We speculate that this might be dependent on the relevant entities in the questions and the quality of the retrieved Wikipedia context. Further experiments are needed to analyze the impact of this additional context. We also speculate that Wikipedia might not be a good proxy knowledge base for doing domain- specific RAG for these models because they are probably already highly trained on Wikipedia data and therefore the additional knowledge from this source might not tell the models much that they not already know. Another reason for the inconsistent results with Wikipedia context could be that we only prepended the context for the last prompt, and the preceding n-shot examples were not generated with taking this context into account. Regarding fine-tuning, we had the impression that the commercial offering from OpenAI was not worth the cost. Even though it led to top results in some batches (batch 3 in Table 11) it also produced models with worse results than their significantly cheaper non fine-tuned counterpart in others. Maybe with the right training set and the right training run you might get a consistently superior model, but then you also have to add the engineering cost compared to simple 10-shot learning to the already more expensive usage and fine-tuning costs. We had a similar impression regarding the QLoRa fine-tuning that we explored for Mixtral 8x7B. For example, the Mixtral model fine-tuned for list question answering was performing better than all other systems in batch 1 of Phase A+ (see Table 10) but worse than its non fine-tuned counterpart in batch 2 of Phase B (see Table 13). Overall, adapter fine-tuning appears not to be straightforward and requires more time for dataset creation, training and testing than simple few-shot learning. A more promising research direction might be selecting optimal few-shot examples for a given task. It is important to note that most of the results we presented here are preliminary and might change when the manual assessment of the system responses is completed by the BioASQ experts. But we expect that yes/no answers will stay the same and factoid and list answers might just change slightly. The preliminary performance of our systems in the document retrieval stage (Phase A) was quite poor compared to the other systems. We speculate that our approach of relying on TF_IDF-based retrieval and adding a richer semantic representation to the keyword query instead of using embeddings and vector search to add such information is not in line with the baseline system used to create the preliminary gold set. If that is true, it might be possible that our retrieval performance is actually better than we expect and the performance improves when the final results are out. But it could also just be that the approach is inferior. The good performance of our systems in Phase A+ where the questions had to be answered without gold snippets, might indicate that our retrieved snippets are not as useless as the preliminary results from Phase A suggest. For future work, we would like to further explore optimal few-shot example selection [30][31][32], as few-shot learning seems to offer the best flexibility and requires less engineering effort compared to fine-tuning while being transferable between models. We also would like to revisit our knowledge base context augmentation approach, beyond the BioASQ challenge. We think that on a more technical test set where the relevant knowledge is highly unlikely to be present in the pre-training of these models, this approach could have a bigger impact. We also would need to use a different knowledge source as the English Wikipedia, which is part of most pre-training datasets [33]. 6. Ethical Considerations The current generation of LLMs still exhibit the phenomenon of so-called hallucinations [34] that is they sometimes make up factually incorrect statements and even harmful misinformation. LLMs might also reproduce myths or misinformation that they encountered during their open-domain training or that might have been added to their input context. A recent prominent case was Google’s new AI overview feature suggesting users should add glue to their pizza24 . The hallucination and misinformation problems seem fundamental and difficult to solve as they have been known for quite some time now and even Google, one of the most experienced AI research companies, was unable to save itself from repeated embarrassment. Even though RAG has been shown to reduce hallucinations in some settings [21] occasional hal- lucinations might still happen, which could be especially problematic in biomedical use cases and might warrant additional manual fact checking [35] before using the output of LLM-based systems in downstream tasks [36]. Another issue to consider is data privacy. These models might repeat their training data if prompted in a specific way [37]. That means training data has to be carefully anonymized before training or fine-tuning these models. The same problem arises for the few-shot examples, personal data should be removed from all context that these models might repeat. They also might just make up facts about people, which could put vendors and service providers at legal risk for defamation25 . Another big ethical issue is job replacement and automation. Klarna, a financial service provider and one of the early enterprise customers of OpenAI published a report in February 2024 stating that their AI-powered customer support assistant handled one third of their customer support requests "doing the equivalent work of 700 full-time agents"26 . This automation trend could not only lead to societal issues if more and more people are made redundant with LLM-powered systems, but also to quality issues when humans are taken out of the loop and more and more users and companies are trusting LLM-generated content without double-checking it2728 . 7. Conclusion We showed that a downloadable open-source model (Mixtral 8x7B Instruct V0.1) was competitive with some of the best available commercial models in a domain-specific biomedical RAG setting when used with 10-shot learning. This opens up the possibility to have state-of-the-art performance in use cases where using third-party APIs is not feasible because of the confidentiality of the data. The model used via a commercial hosting service was also significantly faster than Claude 3 Opus while being at least 30x cheaper. We also observed that the zero-shot performance of this model was still lagging behind its commercial competitors, making it even unusable in some settings where highly specific structured output is required. We were unable to achieve consistent performance improvements from QLoRa fine-tuning Mixtral or fine-tuning gpt-3.5-turbo via the proprietary fine-tuning service of OpenAI. This might be an indication that successfully fine-tuning these LLMs requires more engineering effort and costs and might not be worth the effort in some use cases compared to few-shot learning. 24 https://web.archive.org/web/20240529100801/https://www.theverge.com/2024/5/23/24162896/ google-ai-overview-hallucinations-glue-in-pizza 25 https://www.reuters.com/technology/australian-mayor-readies-worlds-first-defamation-lawsuit-over-chatgpt-content-2023-04-05/ 26 https://web.archive.org/web/20240305093659/https://www.klarna.com/international/press/ klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/ 27 https://web.archive.org/web/20240306115841/https://www.forbes.com/sites/mollybohannon/2023/06/08/ lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/ 28 https://web.archive.org/web/20240304162744/https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers- We tried to augment the context of these models with additional relevant knowledge from a knowledge base (Wikipedia), but again could not see consistent performance improvements. We speculate that this might be due to the knowledge in this setup being not novel enough for the models, or because of the way we combined it with few-shot learning. For future work we want to verify our results with different, more domain-specific tasks where less knowledge might have been present during the pre-training of these LLMs, and we also want to further explore optimal selection of few-shot examples, as this seems to get the best performance out of these models while being straightforward methodically. Acknowledgments We want to thank the organizers of the BioASQ challenge for setting up this challenge and supporting us during our participation. We are also grateful for the feedback and recommendations of the anonymous reviewers. References [1] H. Chen, F. Jiao, X. Li, C. Qin, M. Ravaut, R. Zhao, C. Xiong, S. Joty, Chatgpt’s one-year anniversary: Are open-source large language models catching up?, 2024. arXiv:2311.16989. [2] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances in Neural Information Processing Systems 33 (2020) 9459–9474. [3] A. Balaguer, V. Benara, R. L. de Freitas Cunha, R. de M. Estevão Filho, T. Hendry, D. Holstein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes, R. Padilha, M. Sharp, B. Silva, S. Sharma, V. Aski, R. Chandra, RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, 2024. arXiv:2401.08406. [4] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger, N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), 2024. [5] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [6] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz, G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, 2024. [7] V. Davydova, N. Loukachevitch, E. Tutubalina, Overview of BioNNE Task on Biomedical Nested Named Entity Recognition at BioASQ 2024, in: CLEF Working Notes, 2024. [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is All You Need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, p. 6000–6010. [9] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving language understanding by generative pre-training, preprint, 2018. URL: https://web.archive.org/web/20240522131718/https: //cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [10] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022) 27730–27744. [11] OpenAI, GPT-4 Technical Report, 2023. arXiv:2303.08774. [12] S. Ateia, U. Kruschwitz, Is chatgpt a biomedical expert?, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 73–90. URL: https://ceur-ws.org/Vol-3497/paper-006.pdf. [13] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mixtral of Experts, 2024. arXiv:2401.04088. [14] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, Adaptive mixtures of local experts, Neural computation 3 (1991) 79–87. [15] D. Eigen, M. Ranzato, I. Sutskever, Learning factored representations in a deep mixture of experts, 2014. arXiv:1312.4314. [16] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [17] M. Palatucci, D. Pomerleau, G. E. Hinton, T. M. Mitchell, Zero-shot learning with semantic output codes, Advances in neural information processing systems 22 (2009). [18] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing, ACM Comput. Surv. 55 (2023). URL: https://doi.org/10.1145/3560815. doi:10.1145/3560815. [19] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient Finetuning of Quantized LLMs, 2023. arXiv:2305.14314. [20] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank Adaptation of Large Language Models, in: International Conference on Learning Representations, 2022. URL: https://openreview.net/forum?id=nZeVKeeFYf9. [21] K. Shuster, S. Poff, M. Chen, D. Kiela, J. Weston, Retrieval Augmentation Reduces Hallucination in Conversation, in: Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 3784–3803. [22] J. I. Tait, An Introduction to Professional Search, Springer International Publishing, Cham, 2014, pp. 1–5. URL: https://doi.org/10.1007/978-3-319-12511-4_1. doi:10.1007/978-3-319-12511-4_1. [23] S. Verberne, J. He, U. Kruschwitz, G. Wiggers, B. Larsen, T. Russell-Rose, A. P. de Vries, First international workshop on professional search, SIGIR Forum 52 (2018) 153–162. [24] A. MacFarlane, T. Russell-Rose, F. Shokraneh, Search strategy formulation for systematic reviews: Issues, challenges and opportunities, Intelligent Systems with Applications 15 (2022) 200091. URL: https://www.sciencedirect.com/science/article/pii/S266730532200031X. doi:https://doi.org/ 10.1016/j.iswa.2022.200091. [25] T. Russell-Rose, P. Gooch, U. Kruschwitz, Interactive query expansion for pro- fessional search applications, Business Information Review 38 (2021) 127–137. URL: https://doi.org/10.1177/02663821211034079. doi:10.1177/02663821211034079. arXiv:https://doi.org/10.1177/02663821211034079. [26] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, I. Stoica, Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, 2024. arXiv:2403.04132. [27] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv:2307.09288. [28] AI@Meta, Llama 3 Model Card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [29] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Mag- nusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. Strubell, N. Subramani, M. Wortsman, P. Dasigi, N. Lambert, K. Richardson, L. Zettlemoyer, J. Dodge, K. Lo, L. Soldaini, N. A. Smith, H. Hajishirzi, OLMo: Accelerating the Science of Language Models, 2024. arXiv:2402.00838. [30] Y. Lu, M. Bartolo, A. Moore, S. Riedel, P. Stenetorp, Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8086–8098. [31] Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot perfor- mance of language models, in: International conference on machine learning, PMLR, 2021, pp. 12697–12706. [32] S. Schulhoff, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhoff, P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava, H. D. Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskoff, M. Carpuat, J. White, S. Anadkat, A. Hoyle, P. Resnik, The Prompt Report: A Systematic Survey of Prompting Techniques, 2024. arXiv:2406.06608. [33] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, C. Leahy, The Pile: An 800GB Dataset of Diverse Text for Language Modeling, 2020. URL: https://arxiv.org/abs/2101.00027. arXiv:2101.00027. [34] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of Hallucination in Natural Language Generation, ACM Comput. Surv. 55 (2023). URL: https: //doi.org/10.1145/3571730. doi:10.1145/3571730. [35] P. Nakov, D. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barron-Cedeno, P. Papotti, S. Shaar, G. Da San Martino, et al., Automated Fact-Checking for Assisting Human Fact-Checkers, in: IJCAI, International Joint Conferences on Artificial Intelligence, 2021, pp. 4551–4558. [36] S. S. Kim, Q. V. Liao, M. Vorvoreanu, S. Ballard, J. W. Vaughan, “I’m Not Sure, But...”: Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust, 2024. arXiv:2405.00623. [37] M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, K. Lee, Scalable extraction of training data from (production) language models, 2023. arXiv:2311.17035.