<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Medical Benchmark for Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Romano</string-name>
          <email>antonio.romano5@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Riccio</string-name>
          <email>giuseppe.riccio3@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariano Barone</string-name>
          <email>mariano.barone@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Postiglione</string-name>
          <email>marco.postiglione@northwestern.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Moscato</string-name>
          <email>vincenzo.moscato@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Healthcare NLP, Medical QA Dataset, Generative AI, Large Language Models</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>60208</institution>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Consorzio Interuniversitario Nazionale per l'Informatica (CINI) - ITEM National Lab, Complesso Universitario Monte S.Angelo</institution>
          ,
          <addr-line>Naples</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Northwestern University, Department of Computer Science, McCormick School of Engineering and Applied Science</institution>
          ,
          <addr-line>2233 Tech Dr, Evanston, IL</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Naples Federico II, Department of Electrical Engineering and Information Technology (DIETI)</institution>
          ,
          <addr-line>Via Claudio, 21 - 80125 - Naples</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: IMB-QA, containing 782,644 patient-doctor conversations from 77 medical categories, and IMB-MCQA, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that efective medical AI systems may benefit more from domain expertise and eficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: CLiC-it 2025: Eleventh Italian Conference on Computational LinguisBenchmark for Multiple Choice Question Answering), 0009-0000-5377-5051 (A. Romano); 0009-0002-8613-1126 (G. Riccio); 0009-0004-0744-2386 (M. Barone); 0000-0001-6092-940X (M. Postiglione); 0000-0002-0754-7696 (V. Moscato)</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
Workshop</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Since the early days of the Internet, online medical
forums have facilitated direct, valuable interactions
between patients and healthcare professionals, creating an
accessible space for medical advice and support. While
these platforms serve as vital resources for medical
guidance, they present unique challenges for Natural
Language Processing (NLP) systems, particularly in Question</p>
      <sec id="sec-2-1">
        <title>Answering (QA) tasks. Unlike traditional medical texts,</title>
        <p>these conversations are characterized by colloquial
language, implicit medical knowledge, and cultural nuances
that current QA systems struggle to interpret accurately.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Existing biomedical QA research has primarily focused</title>
        <p>
          LGOBE
tics, September 24 — 26, 2025, Cagliari, Italy
∗Corresponding author.
https://github.com/csmariano (M. Barone);
http://wpage.unina.it/vmoscato/ (V. Moscato)
on structured, English-language content, leveraging
pretrained models like BERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], RoBERTa [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and BioBERT
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. While these models have shown promising results on
standard QA benchmarks [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], they are
predominantly trained on formal medical literature and
standardized exam questions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This creates a significant gap
between model capabilities and real-world medical
communication needs, particularly in non-English contexts.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>To address these challenges, we introduce two comple</title>
        <p>mentary datasets: IMB-QA (Italian Medical Benchmark
for Question Answering), a comprehensive collection
of 782,644 real-world medical conversations across 77
medical categories from Italian online forums
MedicItalia1 and Dica332; and IMB-MCQA (Italian Medical
containing 25,862 multiple-choice questions and answers
from medical specialty admission exams collected from
the simulator CompitoInClasse.org3. Both datasets have
been carefully curated, with IMB-QA specifically
enhanced through LLM-based methodologies to ensure
quality and anonymity while preserving the authentic
nature of patient-doctor interactions.</p>
        <p>Our work goes beyond data contribution through
extensive experimentation with state-of-the-art language
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 3https://www.compitoinclasse.org/
Attribution 4.0 International (CC BY 4.0).
models. We conduct a systematic evaluation of various
LLM architectures, comparing models of diferent sizes
and training backgrounds, with particular attention to
those specialized in biomedical domains. Through this
analysis, we explore the two standard approaches to
enhance medical QA performance: Retrieval Augmented
Generation (RAG) and in-domain fine-tuning. Our
experiments with RAG demonstrate significant improvements
in response accuracy and completeness, while our
finetuning studies reveal the potential of task adaptation
even for smaller models. The dual nature of our datasets
— spanning both informal forum discussions and formal
medical examinations — provides a unique opportunity
to assess model performance across diferent types of
medical communication. Our findings challenge
conventional assumptions about model size and generalization,
suggesting that targeted task adaptation and
retrievalbased approaches may be more crucial for medical QA
than raw model scale.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>
        In Question Answering (QA), models are typically pro- is inherently limited to this specific context. CliCR [ 13]
vided with a relevant text from which they must extract employs cloze-style questions derived from clinical case
answers. However, in real-world applications, manually reports to assess comprehension and inference abilities,
curating such texts is impractical due to the high cost of yet its scope is restricted to a narrow set of medical
conobtaining annotated contexts. This challenge has driven ditions. Although most biomedical QA datasets are
availthe development of Open-Domain QA (OpenQA), where able only in English, some eforts have targeted other
models must autonomously retrieve and understand rel- languages. LiveQA-Med [14] provides a small set of 634
evant information to generate accurate responses [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In annotated medical question-answer pairs, but its test
the biomedical domain, numerous datasets have been set (104 questions) is too limited for robust evaluation.
introduced to advance QA, particularly in high-resource MEDQA [15], built from medical board exams in English
languages such as English (as shown in Table 1). How- and Chinese, does not clearly specify the balance between
ever, resources for other linguistic domains—especially languages or the translation quality. WebMedQA [17],
Italian—remain scarce, limiting the development and eval- derived from Chinese health consultancy platforms,
reuation of multilingual biomedical QA models. lfects real-world medical inquiries, though its reliability
depends on the moderation of user-generated content.
      </p>
      <p>
        Open-Domain and MRC Biomedical QA Several
datasets support OpenQA and Machine Reading Compre- Multiple Choice QA Several datasets focus on
hension (MRC) in the biomedical field. BiQA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] compiles multiple-choice QA (MCQA) for biomedical
applicaquestions from online forums (e.g., Stack Exchange, Red- tions. HEAD-QA [19] and MedMCQA [20] assess
dodit) and links them to PubMed articles, though the accu- main knowledge and reasoning skills but lack coverage
racy of this linking remains largely unverified. HealthQA for Italian. PubMedQA presents a distinct format where
[10] consists of manually curated medical questions with article titles serve as binary-answer questions, though it
answers sourced from patient information websites, yet it does not address complex inferential reasoning. While
lacks a systematic quality assessment. BioRead [23] and ChiMed [21] and cMedQA [15] provide Chinese-language
its extended version, BioMRC [24], annotate texts using biomedical MCQA datasets, Italian biomedical QA
reUnified Medical Language System (UMLS) concepts, en- sources remain virtually nonexistent. QA4-MRE [22]
hancing knowledge representation but focusing more on attempted to introduce multilingual medical reading
comstructured information extraction rather than OpenQA. prehension, yet its dataset was limited in both scale and
The COVID-19 pandemic and the creation of special- scope. To address this gap, we introduce a large-scale
Italized datasets such as EPIC-QA [11] and COVID-QA [12], ian biomedical QA dataset, consisting of 782,644
questionwhich compile question-answer pairs from pandemic- answer pairs spanning 77 medical categories, alongside
related literature. However, their long-term relevance an Italian biomedical MCQA dataset with 25,862
multipleID 14571
URL
https://www.medicitalia.it/consulti/psicoterapia/23549-psichiatri-per-fare-psicoterapianecessitano-di-specializzazione.html
Question ITA: Psichiatri: per fare psicoterapia necessitano di
specializzazione? Gentili dottori, uno psichiatra è
automaticamente abilitato a fare lo psicoterapeuta,
o deve, come gli psicologi che intendono anche
essere psicoterapeuti, fare i 4 anni di specializzazione
in psicoterapia? Grazie.
      </p>
      <p>ENG: Psychiatrists: do they need specialization to
practice psychotherapy? Dear doctors, is a
psychiatrist automatically allowed to practice
psychotherapy, or does he need, like psychologists who also
wish to be psychotherapists, to undergo 4 years of
specialization in psychotherapy? Thank you.</p>
      <p>Reformulated An- ITA: Dopo la specializzazione, uno psichiatra può
swer iscriversi all’albo degli psicoterapeuti senza dover
completare un’ulteriore specializzazione. Sebbene
alcuni psichiatri scelgano di proseguire con studi
supplementari in psicoterapia, tale formazione non
è obbligatoria. Gli psicologi, invece, devono
frequentare un corso di specializzazione per diventare
psicoterapeuti.</p>
      <p>ENG: After specialization, a psychiatrist can
register with the list of psychotherapists without
needing further specialization. While some psychiatrists
choose to pursue additional studies in
psychotherapy, this training is not mandatory. Psychologists,
however, must attend a specialization course to
become psychotherapists.</p>
      <p>Category ITA: Psicoterapia - ENG: Psychotherapy</p>
      <p>Macro Category ITA: Salute Mentale - ENG: Mental Health
choice questions across 60 categories. Compared to
existing datasets, our corpus is significantly larger and more
diverse, enhancing both domain-specific knowledge
extraction and OpenQA capabilities. Furthermore, we
employ advanced post-processing techniques to improve
answer accuracy and applicability in medical
information retrieval tasks.</p>
    </sec>
    <sec id="sec-4">
      <title>3. IMB Dataset</title>
      <sec id="sec-4-1">
        <title>The IMB dataset consists of two structured subsets:</title>
        <p>IMB-QA, which focuses on unstructured, patient-driven
medical inquiries and professional responses, and
IMBMCQA, which contains structured multiple-choice
questions designed for evaluating domain-specific medical
knowledge. The IMB-QA dataset captures natural,
patient-driven inquiries and professional responses,
relfecting real-world medical concerns and interactions
(refer to Table 2 for an example).</p>
        <p>In contrast, the IMB-MCQA dataset consists of
structured multiple-choice questions derived from medical
specialization exam simulators, providing a controlled
environment for evaluating domain-specific knowledge
(an example is shown in Table 3).</p>
        <p>ID
Category</p>
      </sec>
      <sec id="sec-4-2">
        <title>The IMB-QA dataset was constructed by collecting ques</title>
        <p>tions and answers from two Italian medical forums:
MedicItalia and Dica33. These public platforms
facilitate interactions between users and certified healthcare
professionals. The selection of these forums was guided
by qualitative reliability criteria, including verification
of medical credentials and assessment of response
quality. The data extraction process was conducted through
automated retrieval of publicly available information.
To enhance compliance with GDPR requirements, an
anonymization procedure was applied to remove
Personally Identifiable Information (PII). However, we
acknowledge that ensuring complete anonymization is inherently
challenging, especially in medical contexts where indirect
re-identification risks may persist. Future iterations of
the dataset will incorporate additional validation steps to
assess and improve the efectiveness of the
anonymization process. The dataset covers a broad spectrum of
common clinical conditions, supporting its medical
representativeness. Each sample consists of the following
components: A question formulated by a user, representing a
real medical concern and assigned to a specific medical
category; An answer provided by a certified healthcare
professional, reformulated when necessary to improve
clarity and coherence while ensuring the
anonymization of personal data; Additional metadata, including the
corresponding medical category, the macro-category, and,
where applicable, the URL of the original source.</p>
        <p>The IMB-MCQA dataset, on the other hand, was
constructed by collecting multiple-choice questions from
Italian medical specialization exam simulator
CompitoInClasse.org. Each sample consists of the following
components: A question related to a specific clinical topic,
selected from oficial simulators that provide access to
past examination questions; The multiple-choice answers
associated with the question, including one correct
answer validated by domain experts; The medical category
of the question, identifying the relevant medical field (e.g.,
physiology, cardiology, etc.); The percentage of correct
answers, calculated based on responses from a substantial
number of candidates who have used the simulator, with
a minimum response threshold to ensure reliability.</p>
        <sec id="sec-4-2-1">
          <title>3.2. Data preprocessing methods</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>The IMB-QA dataset was built from Italian medical fo</title>
        <p>rums, collecting 782,644 patient questions and certified
professional answers across 77 categories (up to July
2024), capturing real-world interactions.</p>
        <p>The IMB-MCQA dataset was compiled from oficial
Italian medical specialization exams through 2024 and
includes 25,862 multiple-choice questions across 60 clinical
ifelds, each with 4–5 options. As typical with
unstructured sources, both datasets had inconsistencies,
redundancies, and PII. A multi-stage preprocessing pipeline
improved their quality and NLP usability. Summary
statistics are in Table 4.
3.2.1. Preprocessing for IMB-QA</p>
      </sec>
      <sec id="sec-4-4">
        <title>Data cleaning Incomplete/truncated questions were removed, doctor signatures and timestamps stripped, and minor inconsistencies fixed, preserving meaning.</title>
        <p>Text Normalization, Answer Reformulation, and
Data Anonymization These operations were carried generic medical context-appropriate alternatives. The
out using Llama3-Med42-8B [25], a Large Language efectiveness of anonymization was evaluated by
calcuModel (LLM) specialized in the medical domain and lating the percentage of PII — detected using the same
adapted for multilingual tasks. The model underwent a NER model as in the anonymization phase — in the initial,
prompt engineering phase to enhance the clarity, coher- reformulated, and anonymized responses on a subset of
ence, and grammatical accuracy of the responses while approximately 2163 responses equally selected from all
preserving an adequate level of fidelity to medical in- medical categories in the dataset. Initially, 27% of
anformation. User-submitted questions were retained in swers contained PII; reformulation reduced this to 7%,
their original form to preserve the natural variability and ultimately, anonymization decreased the presence of
and authenticity of real-world patient inputs. In con- PII to just 1%.
trast, doctors’ responses were reformulated according
to three main criteria: (i) removal of redundancies and Data Categorization To group questions into broader
colloquial language, (ii) stylistic consistency across re- semantic fields, unsupervised topic modeling via
sponses, and (iii) improved readability for more efective BERTopic [27] was applied. Sentence embeddings were
processing by NLP models. To address anonymization, generated with
”paraphrase-multilingual-MiniLM-L12we utilized Italian_NER_XXL [26], a NER model specifi- v2” [28], reduced via UMAP [29], and clustered using
cally trained in Italian. This model successfully identified HDBSCAN [30]. This enabled flexible, interpretable
PII, such as names of patients and doctors, cities, online macro-categorization without enforcing rigid class
defiresources, email addresses, healthcare facilities, and other nitions. Final groupings are reported in Table 5.
identifiers that could enable re-identification. The
identified PII underwent an anonymization procedure using
the same LLM employed for reformulation, which
preserved sentence semantics while substituting terms with</p>
        <p>Italian
...
3.2.2. Preprocessing for IMB-MCQA
As this dataset was already in a clean, structured exam
format, preprocessing mainly involved organizing entries
and ensuring consistent formatting. No major cleaning
or reformulation was necessary. The workflow is
summarized in Figure 1.</p>
        <sec id="sec-4-4-1">
          <title>3.3. Data Analysis</title>
          <p>3.3.1. Diversity of Questions</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Clinical medicine covers a broad range of topics, reflected</title>
        <p>in the question types within the IMB dataset. To assess
this variety, a qualitative analysis was conducted on a
random sample of 102 questions from IMB-QA and
IMBMCQA. Given the complexity of accurately classifying
questions as fact-based or case-based through
automated methods, manual categorization was chosen.
Factbased questions focus on specific medical knowledge and
clear reasoning, such as “Which condition is linked to
persistent fatigue?”. Case-based questions, instead, present
a patient’s symptoms or medical background, requiring</p>
        <p>Neurology</p>
        <p>Mental health
Cardiology, circulatory
system and hematology
o
a
C
e
n</p>
        <p>Orthopedics and
ry musculoskeletal system
g
teotolaryngology, dentistry</p>
        <p>Ophthalmology,
and pneumology
la General Medicine and
r</p>
        <p>General Surgery
eG Urology, andrology and</p>
        <p>male health
Gynecology and female</p>
        <p>health
Dermatology, al ergies</p>
        <p>and aesthetics
Gastroenterology and
digestive health
0
10
20
30
40
50
60</p>
        <p>70</p>
        <p>Percentage of Questions with Above-Average Difficulty (%)
culty by macro-category in IMB-QA. The score refers to the
percentage of questions in each category that were classified
as above-average in dificulty, based on our dificulty index
multi-step reasoning for diagnosis, treatment decisions, terminology, and syntax. 39.24% were above-average in
or prognosis, such as assessing a patient with chest pain.</p>
      </sec>
      <sec id="sec-4-6">
        <title>The analysis indicates that IMB-QA is predominantly</title>
        <p>composed of case-based questions, where patients
describe symptoms and seek medical guidance, requiring
dificulty, with Neurology exceeding 70%, indicating high
specialization demands (Figure 2).</p>
      </sec>
      <sec id="sec-4-7">
        <title>In IMB-MCQA, dificulty was estimated from par</title>
        <p>ticipant accuracy. Categories like ”Thermal Medicine”
models to perform complex reasoning. Although IMB- (80.12%), ”Ophthalmology” (72.86%), ”Neurosurgery”</p>
      </sec>
      <sec id="sec-4-8">
        <title>MCQA mainly consists of fact-based questions, as it</title>
        <p>evaluates medical knowledge for specialization exams, complexity (Figure 3).
it also includes a considerable number of case-based
inquiries. This dual function highlights the dataset’s role
(71.30%), and ”Nuclear Medicine” (66.95%) showed high</p>
        <p>These results confirm that both datasets require
advanced clinical knowledge, making them valuable for
in assessing both factual knowledge and clinical decision- training models in specialized medical reasoning.
making, with IMB-QA emphasizing patient narratives
and IMB-MCQA blending factual recall with clinical
reasoning.
3.3.2. Need for Domain-Specific Expertise
3.3.3. Diversity of Categories</p>
      </sec>
      <sec id="sec-4-9">
        <title>The IMB dataset shows uneven category distribution,</title>
        <p>afecting model performance across specialties.
IMB</p>
      </sec>
      <sec id="sec-4-10">
        <title>QA (Figure 4) overrepresents areas like ”Gastroenterol</title>
        <p>To evaluate the datasets’ complexity, we assessed ques- ogy”, ”Cardiology”, and ”Urology”, while fields like ”Sleep
tion dificulty. In</p>
        <p>IMB-QA, a sample of 2,500 questions
was analyzed using a dificulty index based on length,</p>
      </sec>
      <sec id="sec-4-11">
        <title>Medicine” and ”Pediatric Surgery” are underrepresented.</title>
      </sec>
      <sec id="sec-4-12">
        <title>This may lead to imbalanced model capabilities. IMB</title>
      </sec>
      <sec id="sec-4-13">
        <title>MCQA (Figure 5) shows a more uniform distribution,</title>
        <p>with most categories having ∼350 questions, except
”General Medicine” (∼5,000), reducing but not eliminating
coverage gaps in niche fields.
3.3.4. Presence of Information Noise and</p>
        <p>Ambiguity in Responses</p>
      </sec>
      <sec id="sec-4-14">
        <title>Challenges in the IMB dataset include noise and am</title>
        <p>biguity. In IMB-QA, informal forum responses often
contain contextual or generic advice, sometimes
prioritizing in-person consultation over definitive answers.
These traits, while realistic, introduce variability.
Preprocessing helped filter irrelevant elements and standardize
responses. In IMB-MCQA, ambiguity stems from
distractors designed to assess reasoning, with some questions
allowing multiple valid interpretations. Such complexity
enhances the dataset’s value in training models to
manage uncertainty and emulate clinical decision-making.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Applications</title>
      <sec id="sec-5-1">
        <title>4.1. Benchmarking Large Language</title>
      </sec>
      <sec id="sec-5-2">
        <title>Models</title>
        <p>and clinical nuances.</p>
        <p>We evaluate open-ended QA using BERTScore [33]
with the multilingual model
bert-base-multilingualcased, chosen for its cross-lingual semantic similarity
capabilities and its widespread adoption in multilingual
NLP benchmarks. For MCQA tasks, we report standard
accuracy. This dual evaluation highlights LLM strengths
and limitations in Italian clinical applications.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.2. Medical Question Answering</title>
        <sec id="sec-5-3-1">
          <title>Medical QA demands models that handle informal, complex queries without hallucinating [34, 35]. We apply</title>
          <p>Retrieval-Augmented Generation (RAG) using a
separate knowledge base of 100k anonymized IMB-QA
answers, explicitly excluding evaluation samples to avoid
data leakage. Relevant contexts are retrieved via dense
embeddings generated with all-MiniLM-L6-v24 and
indexed using FAISS [36]. We retrieve the top-5 most
similar passages, which are then prepended to the query. This
ensures factual grounding while maintaining separation
between retrieved context and target answers. Although
we did not perform a separate retriever evaluation, the
overall gain in BERTScore (Table 7) confirms the added
value of retrieval. The process is formalized as:
 =</p>
          <p>LLM(, (, ))
(1)
where  is the query,  the dataset, and  the retrieval
function. Table 7 shows RAG improves BERTScore
Precision across all categories.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>4.3. Fine-tuning</title>
        <p>
          Evaluating LLMs on domain-specific datasets is essential Fine-tuning improves domain alignment for LLMs,
esto measure their suitability for fields like medicine, where pecially in non-English medical contexts [
          <xref ref-type="bibr" rid="ref10">37, 38</xref>
          ]. Using
precise understanding is required [31]. Despite advance- IMB-QA, we fine-tune Small Language Models (SLMs)
ments in general-purpose knowledge, performance in like Llama-3.2-1B, Gemma-2-2b-it, and Qwen2.5-1.5B
non-English clinical contexts remains limited [32]. IMB- [
          <xref ref-type="bibr" rid="ref11">39</xref>
          ], leveraging [CLS]/[SEP] token strategies,
crossQA and IMB-MCQA enable benchmarking in Italian for entropy loss, and Curriculum Learning [
          <xref ref-type="bibr" rid="ref12">40</xref>
          ] via the
Unboth open-ended and multiple-choice medical QA, cap- sloth[
          <xref ref-type="bibr" rid="ref13">41</xref>
          ] library. This approach aims to enhance output
turing language-specific features, technical terminology, 4https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
        </p>
        <p>Frequency 0k
15k
31k
47k
63k</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.1. Experimental Setup</title>
        <sec id="sec-5-5-1">
          <title>IMB-MCQA ofers a robust benchmark for clinical QA</title>
          <p>in multiple-choice format, evaluated using accuracy. As
shown in Figure 6, models with more than 8B
parameters achieve nearly 85% accuracy, outperforming smaller
models, which struggle with domain-specific reasoning.</p>
          <p>These trends align with prior analyses of category
difaccuracy and reduce hallucinations while ensuring ef- ficulty, where questions involving underrepresented or
ifcient deployment in clinical environments. Although cognitively complex fields proved more challenging even
formal hallucination metrics are not reported, results in for advanced LLMs.</p>
          <p>Table 8 show that fine-tuning on IMB-QA leads to
modest improvements across several metrics, particularly in 5.3. Medical QA Results
BERTScore and BLEU. Gains are model-dependent and
not uniform across all scores: for instance, METEOR IMB-QA allows assessment of open-ended medical QA,
slightly decreases in some cases. Nonetheless, the overall where semantic accuracy is paramount. In Figure 7,
trend supports the efectiveness of task-specific adapta- gemma-2-9b-it outperforms larger models, likely due
tion in improving answer quality in Italian medical QA. to its multilingual training. Despite its smaller size, it
achieves competitive BERTScore Precision (up to 0.638),
suggesting high semantic alignment. This metric is more
5. Experiments informative than fluency-based ones in clinical settings,
where accurate, relevant answers are crucial.</p>
        </sec>
        <sec id="sec-5-5-2">
          <title>Experiments were conducted on Google Colab Pro using</title>
          <p>an NVIDIA T4 GPU and Intel Xeon CPU. Due to hard- We fine-tuned several SLMs, including Llama-3.2-3B, on
ware constraints, the evaluation focused on the most IMB-QA using an 80/20 train/eval split and leveraging
complex categories, as defined in Section 3.3.2. For IMB- Unsloth library. As shown in Table 8, fine-tuned models
QA, ∼2,000 instances were sampled per category, except generally showed modest improvements over base
verfor the ”Neurology” category, which includes only 998 sions, although gains varied across metrics and models,
instances. In the case of IMB-MCQA, the full set of
in</p>
        </sec>
      </sec>
      <sec id="sec-5-6">
        <title>5.4. Fine-tuning SLMs Results</title>
        <p>LlaInmsatr-u3c.2t-3BBi-oMistral-7MBivst0r.a3l-7B-Inmstacrehusactttr--avl0e.-4-LblaeInmtasatr-u3c.1t-8LBL-a8MB-AInnsttin-DoP-B3Oi-oLA--IlTNMaAmIeTdAai--c3a-l8-gBemma-2-9bV-eitlvet-14BLlaInmsatr-u3c.1t-70BGynecology and female health</p>
        <p>(n=2001) 0.624 0.594 0.603 0.608 0.597 0.627 0.618 0.635 0.630 0.613
Ophthalmology, otolaryngology,
dentistry and pneumology 0.630 0.606 0.608 0.617 0.602 0.634 0.622 0.644 0.640 0.622</p>
        <p>(n=2001)
Urology, andrology and male</p>
        <p>health (n=2001) 0.629 0.602 0.608 0.617 0.602 0.630 0.621 0.642 0.636 0.616
Cardiology, circulatory system
and hematology (n=2000) 0.611 0.590 0.589 0.599 0.583 0.613 0.605 0.624 0.619 0.599
Dermatology, al ergies and</p>
        <p>aesthetics (n=2000) 0.626 0.599 0.605 0.614 0.602 0.627 0.617 0.640 0.638 0.614
Gastroenterology and digestive</p>
        <p>health (n=2000) 0.616 0.599 0.598 0.602 0.590 0.617 0.610 0.633 0.624 0.607</p>
        <p>Mental health (n=1999) 0.622 0.595 0.600 0.607 0.598 0.627 0.614 0.637 0.630 0.613
General medicine and other
specialties (n=1997) 0.626 0.601 0.605 0.612 0.598 0.627 0.613 0.638 0.635 0.613</p>
        <p>Orthopedics and
musculoskeletal system</p>
        <p>(n=1997) 0.627 0.604 0.609 0.611 0.601 0.629 0.616 0.642 0.631 0.618</p>
        <p>Neurology (n=998) 0.637 0.609 0.620 0.626 0.610 0.637 0.623 0.653 0.644 0.628
Accuracy Score</p>
        <p>BERTScore Precision
0.2
0.3
0.4
0.5
0.6
0.7
0.8
with some showing performance drops in specific scores improve their ability to understand and reason efectively
such as METEOR. This confirms that task adaptation about medical content.
improves answer quality and contextual understanding,
even for compact models, making them well-suited for Limitations IMB has several limitations, including
clinical applications. an imbalance in specialty representation. Fields such
as ”Gastroenterology” and ”Cardiology” are
overrepre6. Conclusion &amp; Future Work sented, while others, such as ”Sleep Medicine” and
”Pediatric Surgery”, have limited coverage. This imbalance
In this work, we introduced IMB, the first Italian dataset may afect model generalization. We will address this
for medical question-answering, which includes both issue through data balancing techniques, such as
overopen-ended (QA) and multiple-choice (MCQA) questions. sampling and weighted training strategies. Another
limiThe dataset, sourced from medical forums and exam sim- tation arises from informational noise, as the questions
ulators, provides a valuable resource for the development were automatically collected from public sources, which
of advanced NLP models. Our qualitative and quanti- may include irrelevant or ambiguous details. We plan
tative analysis highlighted a diverse range of medical to tackle this challenge by employing semantic filtering
specialties, while also revealing challenges related to and human verification methods. Additionally,
ambiguquestion dificulty and clinical complexity. Initial ex- ity in responses, particularly in the IMB-MCQA dataset,
periments with state-of-the-art language models demon- poses a challenge, which we aim to overcome through
strated that these models struggle with clinically complex disambiguation techniques and more precise annotation
Italian questions but perform relatively well on multiple- strategies.
choice questions. Future work will focus on expanding
the dataset by incorporating additional medical special- Ethical and Legal Considerations Our dataset has
ties and languages (such as English), improving category been developed using content sourced information from
balancing, and implementing advanced filtering tech- publicly accessible Italian medical sites (MedicItalia,
niques to reduce informational noise. Furthermore, we Dica33) as well as a medical exam simulator
(Compiwill explore strategies for adapting language models to toInClasse.org). The dataset is intended exclusively for
academic research with non-commercial objectives,
adhering to legal guidelines regarding GDPR compliance,
data anonymization, and research-related copyright
exemptions as outlined in Italian and EU legislation. To
mitigate any legal and ethical challenges, and based on
consultations with legal experts, we implemented
several measures: (1) Anonymization: All identifying
details (e.g. names, contact details, emails) were removed
or altered with the help of automated scripts and
LLMsupported redaction, conforming to GDPR’s tenets of
data minimization and protection. (2) Textual
Transformation: While we provide links to the original source
of each data sample, the raw questions and answers
were linguistically restructured and polished, involving
grammatical adjustments, simplification, and content
reifnement with the aid of LLMs and manual oversight.
(3) Scientific Scope: This data serves strictly
educational, illustrative, and scientific purposes as permitted
under Article 89 of the GDPR and Article 70 of the
Italian Copyright Law, which allows non-commercial
research data usage under specified conditions. For this
reason, the dataset is distributed under a Creative
Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC
BY-NC-ND 4.0) license. This license strictly restricts
usage to non-commercial research, prohibits redistribution
of altered versions, and mandates proper author
attribution.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work was conducted with the financial support of (1)</title>
        <p>the PNRR MUR project PE0000013-FAIR and (2) the
Italian ministry of economic development, via the ICARUS
(Intelligent Contract Automation for Rethinking User
Services) project (CUP: B69J23000270005).
ing biomedical question answering corpora from Linguistics, Brussels, Belgium, 2018, pp. 2357–2368.
q&amp;a forums, IEEE Access 8 (2020) 161042–161051. URL: https://doi.org/10.18653/v1/d18-1258. doi:10.
doi:10.1109/ACCESS.2020.3020868. 18653/V1/D18- 1258.
[10] M. Zhu, A. Ahuja, W. Wei, C. K. Reddy, A hi- [17] J. He, M. Fu, M. Tu, Applying deep matching
erarchical attention retrieval model for health- networks to chinese medical question answering:
care question answering, in: The World Wide a study and a dataset, BMC Medical
InformatWeb Conference, WWW ’19, Association for Com- ics Decis. Mak. 19-S (2019) 91–100. URL: https:
puting Machinery, New York, NY, USA, 2019, p. //doi.org/10.1186/s12911-019-0761-8. doi:10.1186/
2472–2482. URL: https://doi.org/10.1145/3308558. S12911- 019- 0761- 8.</p>
        <p>3313699. doi:10.1145/3308558.3313699. [18] A. Nentidis, A. Krithara, K. Bougiatiotis,
[11] M. A. Weinzierl, S. M. Harabagiu, The uni- M. Krallinger, C. R. Penagos, M. Villegas,
versity of texas at dallas hltri’s participation in G. Paliouras, Overview of bioasq 2020: The
EPIC-QA: searching for entailed questions reveal- eighth bioasq challenge on large-scale biomedical
ing novel answer nuggets, CoRR abs/2112.13946 semantic indexing and question answering,
(2021) –. URL: https://arxiv.org/abs/2112.13946. in: A. Arampatzis, E. Kanoulas, T. Tsikrika,
arXiv:2112.13946. S. Vrochidis, H. Joho, C. Lioma, C. Eickhof,
[12] T. Möller, A. Reina, R. Jayakumar, M. Pietsch, A. Névéol, L. Cappellato, N. Ferro (Eds.),
ExperiCOVID-QA: A question answering dataset for mental IR Meets Multilinguality, Multimodality,
COVID-19, in: ACL 2020 Workshop on Natural and Interaction - 11th International Conference of
Language Processing for COVID-19 (NLP-COVID), the CLEF Association, CLEF 2020, Thessaloniki,
ACL, Online, 2020, pp. –. URL: https://openreview. Greece, September 22-25, 2020, Proceedings,
net/forum?id=JENSKEEzsoU. volume 12260 of Lecture Notes in Computer Science,
[13] S. Šuster, W. Daelemans, CliCR: a dataset of Springer, Thessaloniki, Greece, 2020, pp. 194–214.
clinical case reports for machine reading compre- URL: https://doi.org/10.1007/978-3-030-58219-7_16.
hension, in: M. Walker, H. Ji, A. Stent (Eds.), doi:10.1007/978- 3- 030- 58219- 7\_16.
Proceedings of the 2018 Conference of the North [19] D. Vilares, C. Gómez-Rodríguez, HEAD-QA: A
American Chapter of the Association for Compu- healthcare dataset for complex reasoning, in: A.
Kotational Linguistics: Human Language Technolo- rhonen, D. Traum, L. Màrquez (Eds.), Proceedings
gies, Volume 1 (Long Papers), Association for Com- of the 57th Annual Meeting of the Association for
putational Linguistics, New Orleans, Louisiana, Computational Linguistics, Association for
Com2018, pp. 1551–1563. URL: https://aclanthology.org/ putational Linguistics, Florence, Italy, 2019, pp.</p>
        <p>N18-1140/. doi:10.18653/v1/N18- 1140. 960–966. URL: https://aclanthology.org/P19-1092/.
[14] A. B. Abacha, E. Agichtein, Y. Pinter, D. Demner- doi:10.18653/v1/P19- 1092.</p>
        <p>Fushman, Overview of the medical question an- [20] A. Pal, L. K. Umapathi, M. Sankarasubbu, Medmcqa:
swering task at TREC 2017 liveqa, in: E. M. A large-scale multi-subject multi-choice dataset for
Voorhees, A. Ellis (Eds.), Proceedings of The medical domain question answering, in: G.
FloTwenty-Sixth Text REtrieval Conference, TREC res, G. H. Chen, T. J. Pollard, J. C. Ho, T.
Nau2017, Gaithersburg, Maryland, USA, November 15- mann (Eds.), Conference on Health, Inference, and
17, 2017, volume 500-324 of NIST Special Publi- Learning, CHIL 2022, 7-8 April 2022, Virtual Event,
cation, National Institute of Standards and Tech- volume 174 of Proceedings of Machine Learning
nology (NIST), Gaithersburg, Maryland, USA, Research, PMLR, Online, 2022, pp. 248–260. URL:
2017, pp. –. URL: https://trec.nist.gov/pubs/trec26/ https://proceedings.mlr.press/v174/pal22a.html.
papers/Overview-QA.pdf. [21] Y. Tian, W. Ma, F. Xia, Y. Song, ChiMed: A
Chi[15] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, nese medical corpus for question answering, in:
P. Szolovits, What disease does this patient have? D. Demner-Fushman, K. B. Cohen, S. Ananiadou,
a large-scale open domain question answering J. Tsujii (Eds.), Proceedings of the 18th BioNLP
dataset from medical exams, 2020. URL: https:// Workshop and Shared Task, Association for
Comarxiv.org/abs/2009.13081. arXiv:2009.13081. putational Linguistics, Florence, Italy, 2019, pp.
[16] A. Pampari, P. Raghavan, J. J. Liang, J. Peng, emrqa: 250–260. URL: https://aclanthology.org/W19-5027/.</p>
        <p>A large corpus for question answering on electronic doi:10.18653/v1/W19- 5027.
medical records, in: E. Rilof, D. Chiang, J. Hock- [22] A. Peñas, E. H. Hovy, P. Forner, Á. Rodrigo,
enmaier, J. Tsujii (Eds.), Proceedings of the 2018 R. F. E. Sutclife, R. Morante, QA4MRE
2011Conference on Empirical Methods in Natural Lan- 2013: Overview of question answering for
maguage Processing, Brussels, Belgium, October 31 - chine reading evaluation, in: P. Forner, H. Müller,
November 4, 2018, Association for Computational R. Paredes, P. Rosso, B. Stein (Eds.), Information
Access Evaluation. Multilinguality, Multimodal- duction, CoRR abs/1802.03426 (2018) –. URL: http:
ity, and Visualization - 4th International Confer- //arxiv.org/abs/1802.03426. arXiv:1802.03426.
ence of the CLEF Initiative, CLEF 2013, Valen- [30] M. F. Rahman, W. Liu, S. B. Suhaim, S.
Thirucia, Spain, September 23-26, 2013. Proceedings, muruganathan, N. Zhang, G. Das, HDBSCAN:
volume 8138 of Lecture Notes in Computer Sci- density based clustering over location based
serence, Springer, Valencia, Spain, 2013, pp. 303–320. vices, CoRR abs/1602.03730 (2016) –. URL: http:
URL: https://doi.org/10.1007/978-3-642-40802-1_29. //arxiv.org/abs/1602.03730. arXiv:1602.03730.
doi:10.1007/978-3-642-40802-1\_29. [31] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian,
[23] D. Pappas, I. Androutsopoulos, H. Papageorgiou, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, M. L.</p>
        <p>Bioread: A new dataset for biomedical reading com- Li, Benchmarking large language models on
prehension, in: N. Calzolari, K. Choukri, C. Cieri, cmexam - A comprehensive chinese medical exam
T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Mae- dataset, in: A. Oh, T. Naumann, A. Globerson,
gaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, K. Saenko, M. Hardt, S. Levine (Eds.), Advances
S. Piperidis, T. Tokunaga (Eds.), Proceedings of the in Neural Information Processing Systems 36:
Eleventh International Conference on Language Annual Conference on Neural Information
Resources and Evaluation, LREC 2018, Miyazaki, Processing Systems 2023, NeurIPS 2023, New
Japan, May 7-12, 2018, European Language Re- Orleans, LA, USA, December 10 - 16, 2023,
sources Association (ELRA), Miyazaki, Japan, 2018, NeurIPS, New Orleans, LA, USA, 2023, pp. –. URL:
pp. –. URL: http://www.lrec-conf.org/proceedings/ http://papers.nips.cc/paper_files/paper/2023/hash/
lrec2018/summaries/795.html. a48ad12d588c597f4725a8b84af647b5-Abstract-Datasets_
[24] D. Pappas, P. Stavropoulos, I. Androutsopou- and_Benchmarks.html.</p>
        <p>los, R. McDonald, BioMRC: A dataset for [32] Y. Jin, M. Chandra, G. Verma, Y. Hu, M. D.
Choudbiomedical machine reading comprehension, in: hury, S. Kumar, Better to ask in english:
CrossD. Demner-Fushman, K. B. Cohen, S. Anani- lingual evaluation of large language models for
adou, J. Tsujii (Eds.), Proceedings of the 19th healthcare queries, in: T. Chua, C. Ngo, R. Kumar,
SIGBioMed Workshop on Biomedical Language H. W. Lauw, R. K. Lee (Eds.), Proceedings of the
Processing, Association for Computational Lin- ACM on Web Conference 2024, WWW 2024,
Singaguistics, Online, 2020, pp. 140–149. URL: https:// pore, May 13-17, 2024, ACM, Singapore, 2024, pp.
aclanthology.org/2020.bionlp-1.15/. doi:10.18653/ 2627–2638. URL: https://doi.org/10.1145/3589334.
v1/2020.bionlp-1.15. 3645643. doi:10.1145/3589334.3645643.
[25] C. Christophe, P. K. Kanithi, T. Raha, S. Khan, [33] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
M. A. Pimentel, Med42-v2: A suite of clinical Y. Artzi, Bertscore: Evaluating text generation with
llms, CoRR abs/2408.06142 (2024) –. URL: https:// BERT, in: 8th International Conference on Learning
doi.org/10.48550/arXiv.2408.06142. doi:10.48550/ Representations, ICLR 2020, Addis Ababa, Ethiopia,
ARXIV.2408.06142. arXiv:2408.06142. April 26-30, 2020, OpenReview.net, Addis Ababa,
[26] DeepMount00, Italian_ner_xxl, https://hugging- Ethiopia, 2020, pp. –. URL: https://openreview.net/
face.co/DeepMount00/Italian_NER_XXL, 2024. forum?id=SkeHuCVFDr.
[27] M. Grootendorst, Bertopic: Neural topic model- [34] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li,
ing with a class-based TF-IDF procedure, CoRR Z. Ding, Chinese medical question answer
matchabs/2203.05794 (2022) –. URL: https://doi.org/ ing using end-to-end character-level multi-scale
10.48550/arXiv.2203.05794. doi:10.48550/ARXIV. cnns, Applied Sciences 7 (2017) 767.</p>
        <p>2203.05794. arXiv:2203.05794. [35] N. Yagnik, J. Jhaveri, V. Sharma, G. Pila, A. Ben,
[28] N. Reimers, I. Gurevych, Sentence-bert: Sen- J. Shang, Medlm: Exploring language models
tence embeddings using siamese bert-networks, in: for medical question answering systems, CoRR
K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of abs/2401.11389 (2024) –. URL: https://doi.org/
the 2019 Conference on Empirical Methods in Nat- 10.48550/arXiv.2401.11389. doi:10.48550/ARXIV.
ural Language Processing and the 9th International 2401.11389. arXiv:2401.11389.</p>
        <p>Joint Conference on Natural Language Processing, [36] M. Douze, A. Guzhva, C. Deng, J. Johnson, G.
SzilEMNLP-IJCNLP 2019, Hong Kong, China, Novem- vasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou,
ber 3-7, 2019, Association for Computational Lin- The faiss library, arXiv preprint arXiv:2401.08281
guistics, Hong Kong, China, 2019, pp. 3980–3990. (2024). arXiv:2401.08281.</p>
        <p>URL: https://doi.org/10.18653/v1/D19-1410. doi:10. [37] H. Tran, Z. Yang, Z. Yao, H. Yu, Bioinstruct:
18653/V1/D19-1410. instruction tuning of large language models for
[29] L. McInnes, J. Healy, UMAP: uniform manifold biomedical natural language processing, J. Am.
approximation and projection for dimension re- Medical Informatics Assoc. 31 (2024) 1821–1832.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and DeepL Write /
DeepL Translate in order to: Grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/N19-1423/. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>N19</fpage>
          - 1423.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          )
          <article-title>-</article-title>
          . URL: http: //arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinform</source>
          .
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          . URL: https://doi.org/10.1093/bioinformatics/btz682. doi:
          <volume>10</volume>
          .1093/BIOINFORMATICS/BTZ682.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lopyrev</surname>
          </string-name>
          , P. Liang, SQuAD:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          , in: J.
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          Carreras (Eds.),
          <source>Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Austin, Texas,
          <year>2016</year>
          , pp.
          <fpage>2383</fpage>
          -
          <lpage>2392</lpage>
          . URL: https://aclanthology.org/ D16-1264/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D16</fpage>
          - 1264.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. G. Carbonell, R. Salakhutdinov,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Xlnet: Generalized autoregressive pretraining for language understanding</article-title>
          , in: H.
          <string-name>
            <surname>M. Wallach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>d'Alché-</article-title>
          <string-name>
            <surname>Buc</surname>
            ,
            <given-names>E. B.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems</source>
          <year>2019</year>
          ,
          <article-title>NeurIPS 2019</article-title>
          , December 8-
          <issue>14</issue>
          ,
          <year>2019</year>
          , Vancouver, BC, Canada, NeurIPS, Vancouver, BC, Canada,
          <year>2019</year>
          , pp.
          <fpage>5754</fpage>
          -
          <lpage>5764</lpage>
          . URL: https://proceedings.neurips.cc/paper/2019/hash/ dc6a7e655d7e5840e66733e9ee67cc69-Abstract. html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Redfield</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kelcey</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          ,
          <article-title>Natural questions: A benchmark for question answering research, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>452</fpage>
          -
          <lpage>466</lpage>
          . URL: https://aclanthology.org/Q19-1026/. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          , G. Balikas,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          , I. Partalas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zschunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Polychronopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Almirantis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Baskiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gallinari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Artières</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heino</surname>
          </string-name>
          , É. Gaussier,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrio-Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          , I. Androutsopoulos,
          <string-name>
            <surname>G. Paliouras,</surname>
          </string-name>
          <article-title>An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <volume>138</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>138</lpage>
          :
          <fpage>28</fpage>
          . URL: https:// doi.org/10.1186/s12859-015-0564-6. doi:
          <volume>10</volume>
          .1186/ S12859- 015- 0564- 6.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Retrieve what you need: A mutual learning framework for open-domain question answering</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>247</fpage>
          -
          <lpage>263</lpage>
          . URL: https: //doi.org/10.1162/tacl_a_00646. doi:
          <volume>10</volume>
          .1162/TACL\ _A\_
          <volume>00646</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamurias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Couto</surname>
          </string-name>
          , GeneratURL: https://doi.org/10.1093/jamia/ocae122. doi:
          <volume>10</volume>
          . 1093/JAMIA/OCAE122.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Dorfner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Busch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Makowski</surname>
          </string-name>
          , T. Han,
          <string-name>
            <given-names>D.</given-names>
            <surname>Truhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleesiek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sushil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lammert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Bressem</surname>
          </string-name>
          ,
          <article-title>Biomedical large languages models seem not to be superior to generalist models on unseen medical data</article-title>
          ,
          <source>CoRR abs/2408</source>
          .13833 (
          <year>2024</year>
          )
          <article-title>-</article-title>
          . URL: https:// doi.org/10.48550/arXiv.2408.13833. doi:
          <volume>10</volume>
          .48550/ ARXIV.2408.13833. arXiv:
          <volume>2408</volume>
          .
          <fpage>13833</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [39]
          <string-name>
            <surname>C. Van Nguyen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aponte</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Basu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kunapuli</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Barrow</surname>
          </string-name>
          , et al.,
          <article-title>A survey of small language models</article-title>
          ,
          <source>arXiv preprint arXiv:2410.20011</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Curriculum learning</article-title>
          ,
          <source>in: Proceedings of the 26th Annual International Conference on Machine Learning</source>
          , ICML '09,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2009</year>
          , p.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          . URL: https: //doi.org/10.1145/1553374.1553380. doi:
          <volume>10</volume>
          .1145/ 1553374.1553380.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [41]
          <string-name>
            <surname>M. H. Daniel Han</surname>
          </string-name>
          , U. team, Unsloth,
          <year>2023</year>
          . URL: http://github.com/unslothai/unsloth.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>