-

Medical Benchmark for Question Answering

Antonio Romano

antonio.romano5@unina.it 0 1 3

Giuseppe Riccio

giuseppe.riccio3@unina.it 0 1 3

Mariano Barone

mariano.barone@unina.it 0 1 3

Marco Postiglione

marco.postiglione@northwestern.edu 0 2

Vincenzo Moscato

vincenzo.moscato@unina.it 0 1 3

Healthcare NLP, Medical QA Dataset, Generative AI, Large Language Models

0 60208 , United States 1 Consorzio Interuniversitario Nazionale per l'Informatica (CINI) - ITEM National Lab, Complesso Universitario Monte S.Angelo , Naples , Italy 2 Northwestern University, Department of Computer Science, McCormick School of Engineering and Applied Science , 2233 Tech Dr, Evanston, IL , USA 3 University of Naples Federico II, Department of Electrical Engineering and Information Technology (DIETI) , Via Claudio, 21 - 80125 - Naples

2025

Online medical forums have long served as vital platforms where patients seek professional healthcare advice, generating vast amounts of valuable knowledge. However, the informal nature and linguistic complexity of forum interactions pose significant challenges for automated question answering systems, especially when dealing with non-English languages. We present two comprehensive Italian medical benchmarks: IMB-QA, containing 782,644 patient-doctor conversations from 77 medical categories, and IMB-MCQA, comprising 25,862 multiple-choice questions from medical specialty examinations. We demonstrate how Large Language Models (LLMs) can be leveraged to improve the clarity and consistency of medical forum data while retaining their original meaning and conversational style, and compare a variety of LLM architectures on both open and multiple-choice question answering tasks. Our experiments with Retrieval Augmented Generation (RAG) and domain-specific fine-tuning reveal that specialized adaptation strategies can outperform larger, general-purpose models in medical question answering tasks. These findings suggest that efective medical AI systems may benefit more from domain expertise and eficient information retrieval than from increased model scale. We release both datasets and evaluation frameworks in our GitHub repository to support further research on multilingual medical question answering: CLiC-it 2025: Eleventh Italian Conference on Computational LinguisBenchmark for Multiple Choice Question Answering), 0009-0000-5377-5051 (A. Romano); 0009-0002-8613-1126 (G. Riccio); 0009-0004-0744-2386 (M. Barone); 0000-0001-6092-940X (M. Postiglione); 0000-0002-0754-7696 (V. Moscato)

CEUR Workshop

1. Introduction

Since the early days of the Internet, online medical forums have facilitated direct, valuable interactions between patients and healthcare professionals, creating an accessible space for medical advice and support. While these platforms serve as vital resources for medical guidance, they present unique challenges for Natural Language Processing (NLP) systems, particularly in Question

Answering (QA) tasks. Unlike traditional medical texts,

these conversations are characterized by colloquial language, implicit medical knowledge, and cultural nuances that current QA systems struggle to interpret accurately.

Existing biomedical QA research has primarily focused

LGOBE tics, September 24 — 26, 2025, Cagliari, Italy ∗Corresponding author. https://github.com/csmariano (M. Barone); http://wpage.unina.it/vmoscato/ (V. Moscato) on structured, English-language content, leveraging pretrained models like BERT [ 1 ], RoBERTa [ 2 ], and BioBERT [ 3 ]. While these models have shown promising results on standard QA benchmarks [ 4 ], [ 5 ], [ 6 ], they are predominantly trained on formal medical literature and standardized exam questions [ 7 ]. This creates a significant gap between model capabilities and real-world medical communication needs, particularly in non-English contexts.

To address these challenges, we introduce two comple

mentary datasets: IMB-QA (Italian Medical Benchmark for Question Answering), a comprehensive collection of 782,644 real-world medical conversations across 77 medical categories from Italian online forums MedicItalia1 and Dica332; and IMB-MCQA (Italian Medical containing 25,862 multiple-choice questions and answers from medical specialty admission exams collected from the simulator CompitoInClasse.org3. Both datasets have been carefully curated, with IMB-QA specifically enhanced through LLM-based methodologies to ensure quality and anonymity while preserving the authentic nature of patient-doctor interactions.

Our work goes beyond data contribution through extensive experimentation with state-of-the-art language © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 3https://www.compitoinclasse.org/ Attribution 4.0 International (CC BY 4.0). models. We conduct a systematic evaluation of various LLM architectures, comparing models of diferent sizes and training backgrounds, with particular attention to those specialized in biomedical domains. Through this analysis, we explore the two standard approaches to enhance medical QA performance: Retrieval Augmented Generation (RAG) and in-domain fine-tuning. Our experiments with RAG demonstrate significant improvements in response accuracy and completeness, while our finetuning studies reveal the potential of task adaptation even for smaller models. The dual nature of our datasets — spanning both informal forum discussions and formal medical examinations — provides a unique opportunity to assess model performance across diferent types of medical communication. Our findings challenge conventional assumptions about model size and generalization, suggesting that targeted task adaptation and retrievalbased approaches may be more crucial for medical QA than raw model scale.

2. Related work

In Question Answering (QA), models are typically pro- is inherently limited to this specific context. CliCR [ 13] vided with a relevant text from which they must extract employs cloze-style questions derived from clinical case answers. However, in real-world applications, manually reports to assess comprehension and inference abilities, curating such texts is impractical due to the high cost of yet its scope is restricted to a narrow set of medical conobtaining annotated contexts. This challenge has driven ditions. Although most biomedical QA datasets are availthe development of Open-Domain QA (OpenQA), where able only in English, some eforts have targeted other models must autonomously retrieve and understand rel- languages. LiveQA-Med [14] provides a small set of 634 evant information to generate accurate responses [ 8 ]. In annotated medical question-answer pairs, but its test the biomedical domain, numerous datasets have been set (104 questions) is too limited for robust evaluation. introduced to advance QA, particularly in high-resource MEDQA [15], built from medical board exams in English languages such as English (as shown in Table 1). How- and Chinese, does not clearly specify the balance between ever, resources for other linguistic domains—especially languages or the translation quality. WebMedQA [17], Italian—remain scarce, limiting the development and eval- derived from Chinese health consultancy platforms, reuation of multilingual biomedical QA models. lfects real-world medical inquiries, though its reliability depends on the moderation of user-generated content.

Open-Domain and MRC Biomedical QA Several datasets support OpenQA and Machine Reading Compre- Multiple Choice QA Several datasets focus on hension (MRC) in the biomedical field. BiQA [ 9 ] compiles multiple-choice QA (MCQA) for biomedical applicaquestions from online forums (e.g., Stack Exchange, Red- tions. HEAD-QA [19] and MedMCQA [20] assess dodit) and links them to PubMed articles, though the accu- main knowledge and reasoning skills but lack coverage racy of this linking remains largely unverified. HealthQA for Italian. PubMedQA presents a distinct format where [10] consists of manually curated medical questions with article titles serve as binary-answer questions, though it answers sourced from patient information websites, yet it does not address complex inferential reasoning. While lacks a systematic quality assessment. BioRead [23] and ChiMed [21] and cMedQA [15] provide Chinese-language its extended version, BioMRC [24], annotate texts using biomedical MCQA datasets, Italian biomedical QA reUnified Medical Language System (UMLS) concepts, en- sources remain virtually nonexistent. QA4-MRE [22] hancing knowledge representation but focusing more on attempted to introduce multilingual medical reading comstructured information extraction rather than OpenQA. prehension, yet its dataset was limited in both scale and The COVID-19 pandemic and the creation of special- scope. To address this gap, we introduce a large-scale Italized datasets such as EPIC-QA [11] and COVID-QA [12], ian biomedical QA dataset, consisting of 782,644 questionwhich compile question-answer pairs from pandemic- answer pairs spanning 77 medical categories, alongside related literature. However, their long-term relevance an Italian biomedical MCQA dataset with 25,862 multipleID 14571 URL https://www.medicitalia.it/consulti/psicoterapia/23549-psichiatri-per-fare-psicoterapianecessitano-di-specializzazione.html Question ITA: Psichiatri: per fare psicoterapia necessitano di specializzazione? Gentili dottori, uno psichiatra è automaticamente abilitato a fare lo psicoterapeuta, o deve, come gli psicologi che intendono anche essere psicoterapeuti, fare i 4 anni di specializzazione in psicoterapia? Grazie.

ENG: Psychiatrists: do they need specialization to practice psychotherapy? Dear doctors, is a psychiatrist automatically allowed to practice psychotherapy, or does he need, like psychologists who also wish to be psychotherapists, to undergo 4 years of specialization in psychotherapy? Thank you.

Reformulated An- ITA: Dopo la specializzazione, uno psichiatra può swer iscriversi all’albo degli psicoterapeuti senza dover completare un’ulteriore specializzazione. Sebbene alcuni psichiatri scelgano di proseguire con studi supplementari in psicoterapia, tale formazione non è obbligatoria. Gli psicologi, invece, devono frequentare un corso di specializzazione per diventare psicoterapeuti.

ENG: After specialization, a psychiatrist can register with the list of psychotherapists without needing further specialization. While some psychiatrists choose to pursue additional studies in psychotherapy, this training is not mandatory. Psychologists, however, must attend a specialization course to become psychotherapists.

Category ITA: Psicoterapia - ENG: Psychotherapy

Macro Category ITA: Salute Mentale - ENG: Mental Health choice questions across 60 categories. Compared to existing datasets, our corpus is significantly larger and more diverse, enhancing both domain-specific knowledge extraction and OpenQA capabilities. Furthermore, we employ advanced post-processing techniques to improve answer accuracy and applicability in medical information retrieval tasks.

3. IMB Dataset The IMB dataset consists of two structured subsets:

IMB-QA, which focuses on unstructured, patient-driven medical inquiries and professional responses, and IMBMCQA, which contains structured multiple-choice questions designed for evaluating domain-specific medical knowledge. The IMB-QA dataset captures natural, patient-driven inquiries and professional responses, relfecting real-world medical concerns and interactions (refer to Table 2 for an example).

In contrast, the IMB-MCQA dataset consists of structured multiple-choice questions derived from medical specialization exam simulators, providing a controlled environment for evaluating domain-specific knowledge (an example is shown in Table 3).

ID Category

The IMB-QA dataset was constructed by collecting ques

tions and answers from two Italian medical forums: MedicItalia and Dica33. These public platforms facilitate interactions between users and certified healthcare professionals. The selection of these forums was guided by qualitative reliability criteria, including verification of medical credentials and assessment of response quality. The data extraction process was conducted through automated retrieval of publicly available information. To enhance compliance with GDPR requirements, an anonymization procedure was applied to remove Personally Identifiable Information (PII). However, we acknowledge that ensuring complete anonymization is inherently challenging, especially in medical contexts where indirect re-identification risks may persist. Future iterations of the dataset will incorporate additional validation steps to assess and improve the efectiveness of the anonymization process. The dataset covers a broad spectrum of common clinical conditions, supporting its medical representativeness. Each sample consists of the following components: A question formulated by a user, representing a real medical concern and assigned to a specific medical category; An answer provided by a certified healthcare professional, reformulated when necessary to improve clarity and coherence while ensuring the anonymization of personal data; Additional metadata, including the corresponding medical category, the macro-category, and, where applicable, the URL of the original source.

The IMB-MCQA dataset, on the other hand, was constructed by collecting multiple-choice questions from Italian medical specialization exam simulator CompitoInClasse.org. Each sample consists of the following components: A question related to a specific clinical topic, selected from oficial simulators that provide access to past examination questions; The multiple-choice answers associated with the question, including one correct answer validated by domain experts; The medical category of the question, identifying the relevant medical field (e.g., physiology, cardiology, etc.); The percentage of correct answers, calculated based on responses from a substantial number of candidates who have used the simulator, with a minimum response threshold to ensure reliability.

3.2. Data preprocessing methods The IMB-QA dataset was built from Italian medical fo

rums, collecting 782,644 patient questions and certified professional answers across 77 categories (up to July 2024), capturing real-world interactions.

The IMB-MCQA dataset was compiled from oficial Italian medical specialization exams through 2024 and includes 25,862 multiple-choice questions across 60 clinical ifelds, each with 4–5 options. As typical with unstructured sources, both datasets had inconsistencies, redundancies, and PII. A multi-stage preprocessing pipeline improved their quality and NLP usability. Summary statistics are in Table 4. 3.2.1. Preprocessing for IMB-QA

Data cleaning Incomplete/truncated questions were removed, doctor signatures and timestamps stripped, and minor inconsistencies fixed, preserving meaning.

Text Normalization, Answer Reformulation, and Data Anonymization These operations were carried generic medical context-appropriate alternatives. The out using Llama3-Med42-8B [25], a Large Language efectiveness of anonymization was evaluated by calcuModel (LLM) specialized in the medical domain and lating the percentage of PII — detected using the same adapted for multilingual tasks. The model underwent a NER model as in the anonymization phase — in the initial, prompt engineering phase to enhance the clarity, coher- reformulated, and anonymized responses on a subset of ence, and grammatical accuracy of the responses while approximately 2163 responses equally selected from all preserving an adequate level of fidelity to medical in- medical categories in the dataset. Initially, 27% of anformation. User-submitted questions were retained in swers contained PII; reformulation reduced this to 7%, their original form to preserve the natural variability and ultimately, anonymization decreased the presence of and authenticity of real-world patient inputs. In con- PII to just 1%. trast, doctors’ responses were reformulated according to three main criteria: (i) removal of redundancies and Data Categorization To group questions into broader colloquial language, (ii) stylistic consistency across re- semantic fields, unsupervised topic modeling via sponses, and (iii) improved readability for more efective BERTopic [27] was applied. Sentence embeddings were processing by NLP models. To address anonymization, generated with ”paraphrase-multilingual-MiniLM-L12we utilized Italian_NER_XXL [26], a NER model specifi- v2” [28], reduced via UMAP [29], and clustered using cally trained in Italian. This model successfully identified HDBSCAN [30]. This enabled flexible, interpretable PII, such as names of patients and doctors, cities, online macro-categorization without enforcing rigid class defiresources, email addresses, healthcare facilities, and other nitions. Final groupings are reported in Table 5. identifiers that could enable re-identification. The identified PII underwent an anonymization procedure using the same LLM employed for reformulation, which preserved sentence semantics while substituting terms with

Italian ... 3.2.2. Preprocessing for IMB-MCQA As this dataset was already in a clean, structured exam format, preprocessing mainly involved organizing entries and ensuring consistent formatting. No major cleaning or reformulation was necessary. The workflow is summarized in Figure 1.

3.3. Data Analysis

3.3.1. Diversity of Questions

Clinical medicine covers a broad range of topics, reflected

in the question types within the IMB dataset. To assess this variety, a qualitative analysis was conducted on a random sample of 102 questions from IMB-QA and IMBMCQA. Given the complexity of accurately classifying questions as fact-based or case-based through automated methods, manual categorization was chosen. Factbased questions focus on specific medical knowledge and clear reasoning, such as “Which condition is linked to persistent fatigue?”. Case-based questions, instead, present a patient’s symptoms or medical background, requiring

Neurology

Mental health Cardiology, circulatory system and hematology o a C e n

Orthopedics and ry musculoskeletal system g teotolaryngology, dentistry

Ophthalmology, and pneumology la General Medicine and r

General Surgery eG Urology, andrology and

male health Gynecology and female

health Dermatology, al ergies

and aesthetics Gastroenterology and digestive health 0 10 20 30 40 50 60

Percentage of Questions with Above-Average Difficulty (%) culty by macro-category in IMB-QA. The score refers to the percentage of questions in each category that were classified as above-average in dificulty, based on our dificulty index multi-step reasoning for diagnosis, treatment decisions, terminology, and syntax. 39.24% were above-average in or prognosis, such as assessing a patient with chest pain.

The analysis indicates that IMB-QA is predominantly

composed of case-based questions, where patients describe symptoms and seek medical guidance, requiring dificulty, with Neurology exceeding 70%, indicating high specialization demands (Figure 2).

In IMB-MCQA, dificulty was estimated from par

ticipant accuracy. Categories like ”Thermal Medicine” models to perform complex reasoning. Although IMB- (80.12%), ”Ophthalmology” (72.86%), ”Neurosurgery”

MCQA mainly consists of fact-based questions, as it

evaluates medical knowledge for specialization exams, complexity (Figure 3). it also includes a considerable number of case-based inquiries. This dual function highlights the dataset’s role (71.30%), and ”Nuclear Medicine” (66.95%) showed high

These results confirm that both datasets require advanced clinical knowledge, making them valuable for in assessing both factual knowledge and clinical decision- training models in specialized medical reasoning. making, with IMB-QA emphasizing patient narratives and IMB-MCQA blending factual recall with clinical reasoning. 3.3.2. Need for Domain-Specific Expertise 3.3.3. Diversity of Categories

The IMB dataset shows uneven category distribution,

afecting model performance across specialties. IMB

QA (Figure 4) overrepresents areas like ”Gastroenterol

To evaluate the datasets’ complexity, we assessed ques- ogy”, ”Cardiology”, and ”Urology”, while fields like ”Sleep tion dificulty. In

IMB-QA, a sample of 2,500 questions was analyzed using a dificulty index based on length,

Medicine” and ”Pediatric Surgery” are underrepresented. This may lead to imbalanced model capabilities. IMB MCQA (Figure 5) shows a more uniform distribution,

with most categories having ∼350 questions, except ”General Medicine” (∼5,000), reducing but not eliminating coverage gaps in niche fields. 3.3.4. Presence of Information Noise and

Ambiguity in Responses

Challenges in the IMB dataset include noise and am

biguity. In IMB-QA, informal forum responses often contain contextual or generic advice, sometimes prioritizing in-person consultation over definitive answers. These traits, while realistic, introduce variability. Preprocessing helped filter irrelevant elements and standardize responses. In IMB-MCQA, ambiguity stems from distractors designed to assess reasoning, with some questions allowing multiple valid interpretations. Such complexity enhances the dataset’s value in training models to manage uncertainty and emulate clinical decision-making.

4. Applications 4.1. Benchmarking Large Language Models

and clinical nuances.

We evaluate open-ended QA using BERTScore [33] with the multilingual model bert-base-multilingualcased, chosen for its cross-lingual semantic similarity capabilities and its widespread adoption in multilingual NLP benchmarks. For MCQA tasks, we report standard accuracy. This dual evaluation highlights LLM strengths and limitations in Italian clinical applications.

4.2. Medical Question Answering Medical QA demands models that handle informal, complex queries without hallucinating [34, 35]. We apply

Retrieval-Augmented Generation (RAG) using a separate knowledge base of 100k anonymized IMB-QA answers, explicitly excluding evaluation samples to avoid data leakage. Relevant contexts are retrieved via dense embeddings generated with all-MiniLM-L6-v24 and indexed using FAISS [36]. We retrieve the top-5 most similar passages, which are then prepended to the query. This ensures factual grounding while maintaining separation between retrieved context and target answers. Although we did not perform a separate retriever evaluation, the overall gain in BERTScore (Table 7) confirms the added value of retrieval. The process is formalized as: =

LLM(, (, )) (1) where is the query, the dataset, and the retrieval function. Table 7 shows RAG improves BERTScore Precision across all categories.

4.3. Fine-tuning

Evaluating LLMs on domain-specific datasets is essential Fine-tuning improves domain alignment for LLMs, esto measure their suitability for fields like medicine, where pecially in non-English medical contexts [ 37, 38 ]. Using precise understanding is required [31]. Despite advance- IMB-QA, we fine-tune Small Language Models (SLMs) ments in general-purpose knowledge, performance in like Llama-3.2-1B, Gemma-2-2b-it, and Qwen2.5-1.5B non-English clinical contexts remains limited [32]. IMB- [ 39 ], leveraging [CLS]/[SEP] token strategies, crossQA and IMB-MCQA enable benchmarking in Italian for entropy loss, and Curriculum Learning [ 40 ] via the Unboth open-ended and multiple-choice medical QA, cap- sloth[ 41 ] library. This approach aims to enhance output turing language-specific features, technical terminology, 4https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

Frequency 0k 15k 31k 47k 63k

5.1. Experimental Setup IMB-MCQA ofers a robust benchmark for clinical QA

in multiple-choice format, evaluated using accuracy. As shown in Figure 6, models with more than 8B parameters achieve nearly 85% accuracy, outperforming smaller models, which struggle with domain-specific reasoning.

These trends align with prior analyses of category difaccuracy and reduce hallucinations while ensuring ef- ficulty, where questions involving underrepresented or ifcient deployment in clinical environments. Although cognitively complex fields proved more challenging even formal hallucination metrics are not reported, results in for advanced LLMs.

Table 8 show that fine-tuning on IMB-QA leads to modest improvements across several metrics, particularly in 5.3. Medical QA Results BERTScore and BLEU. Gains are model-dependent and not uniform across all scores: for instance, METEOR IMB-QA allows assessment of open-ended medical QA, slightly decreases in some cases. Nonetheless, the overall where semantic accuracy is paramount. In Figure 7, trend supports the efectiveness of task-specific adapta- gemma-2-9b-it outperforms larger models, likely due tion in improving answer quality in Italian medical QA. to its multilingual training. Despite its smaller size, it achieves competitive BERTScore Precision (up to 0.638), suggesting high semantic alignment. This metric is more 5. Experiments informative than fluency-based ones in clinical settings, where accurate, relevant answers are crucial.

Experiments were conducted on Google Colab Pro using

an NVIDIA T4 GPU and Intel Xeon CPU. Due to hard- We fine-tuned several SLMs, including Llama-3.2-3B, on ware constraints, the evaluation focused on the most IMB-QA using an 80/20 train/eval split and leveraging complex categories, as defined in Section 3.3.2. For IMB- Unsloth library. As shown in Table 8, fine-tuned models QA, ∼2,000 instances were sampled per category, except generally showed modest improvements over base verfor the ”Neurology” category, which includes only 998 sions, although gains varied across metrics and models, instances. In the case of IMB-MCQA, the full set of in

5.4. Fine-tuning SLMs Results

LlaInmsatr-u3c.2t-3BBi-oMistral-7MBivst0r.a3l-7B-Inmstacrehusactttr--avl0e.-4-LblaeInmtasatr-u3c.1t-8LBL-a8MB-AInnsttin-DoP-B3Oi-oLA--IlTNMaAmIeTdAai--c3a-l8-gBemma-2-9bV-eitlvet-14BLlaInmsatr-u3c.1t-70BGynecology and female health

(n=2001) 0.624 0.594 0.603 0.608 0.597 0.627 0.618 0.635 0.630 0.613 Ophthalmology, otolaryngology, dentistry and pneumology 0.630 0.606 0.608 0.617 0.602 0.634 0.622 0.644 0.640 0.622

(n=2001) Urology, andrology and male

health (n=2001) 0.629 0.602 0.608 0.617 0.602 0.630 0.621 0.642 0.636 0.616 Cardiology, circulatory system and hematology (n=2000) 0.611 0.590 0.589 0.599 0.583 0.613 0.605 0.624 0.619 0.599 Dermatology, al ergies and

aesthetics (n=2000) 0.626 0.599 0.605 0.614 0.602 0.627 0.617 0.640 0.638 0.614 Gastroenterology and digestive

health (n=2000) 0.616 0.599 0.598 0.602 0.590 0.617 0.610 0.633 0.624 0.607

Mental health (n=1999) 0.622 0.595 0.600 0.607 0.598 0.627 0.614 0.637 0.630 0.613 General medicine and other specialties (n=1997) 0.626 0.601 0.605 0.612 0.598 0.627 0.613 0.638 0.635 0.613

Orthopedics and musculoskeletal system

(n=1997) 0.627 0.604 0.609 0.611 0.601 0.629 0.616 0.642 0.631 0.618

Neurology (n=998) 0.637 0.609 0.620 0.626 0.610 0.637 0.623 0.653 0.644 0.628 Accuracy Score

BERTScore Precision 0.2 0.3 0.4 0.5 0.6 0.7 0.8 with some showing performance drops in specific scores improve their ability to understand and reason efectively such as METEOR. This confirms that task adaptation about medical content. improves answer quality and contextual understanding, even for compact models, making them well-suited for Limitations IMB has several limitations, including clinical applications. an imbalance in specialty representation. Fields such as ”Gastroenterology” and ”Cardiology” are overrepre6. Conclusion & Future Work sented, while others, such as ”Sleep Medicine” and ”Pediatric Surgery”, have limited coverage. This imbalance In this work, we introduced IMB, the first Italian dataset may afect model generalization. We will address this for medical question-answering, which includes both issue through data balancing techniques, such as overopen-ended (QA) and multiple-choice (MCQA) questions. sampling and weighted training strategies. Another limiThe dataset, sourced from medical forums and exam sim- tation arises from informational noise, as the questions ulators, provides a valuable resource for the development were automatically collected from public sources, which of advanced NLP models. Our qualitative and quanti- may include irrelevant or ambiguous details. We plan tative analysis highlighted a diverse range of medical to tackle this challenge by employing semantic filtering specialties, while also revealing challenges related to and human verification methods. Additionally, ambiguquestion dificulty and clinical complexity. Initial ex- ity in responses, particularly in the IMB-MCQA dataset, periments with state-of-the-art language models demon- poses a challenge, which we aim to overcome through strated that these models struggle with clinically complex disambiguation techniques and more precise annotation Italian questions but perform relatively well on multiple- strategies. choice questions. Future work will focus on expanding the dataset by incorporating additional medical special- Ethical and Legal Considerations Our dataset has ties and languages (such as English), improving category been developed using content sourced information from balancing, and implementing advanced filtering tech- publicly accessible Italian medical sites (MedicItalia, niques to reduce informational noise. Furthermore, we Dica33) as well as a medical exam simulator (Compiwill explore strategies for adapting language models to toInClasse.org). The dataset is intended exclusively for academic research with non-commercial objectives, adhering to legal guidelines regarding GDPR compliance, data anonymization, and research-related copyright exemptions as outlined in Italian and EU legislation. To mitigate any legal and ethical challenges, and based on consultations with legal experts, we implemented several measures: (1) Anonymization: All identifying details (e.g. names, contact details, emails) were removed or altered with the help of automated scripts and LLMsupported redaction, conforming to GDPR’s tenets of data minimization and protection. (2) Textual Transformation: While we provide links to the original source of each data sample, the raw questions and answers were linguistically restructured and polished, involving grammatical adjustments, simplification, and content reifnement with the aid of LLMs and manual oversight. (3) Scientific Scope: This data serves strictly educational, illustrative, and scientific purposes as permitted under Article 89 of the GDPR and Article 70 of the Italian Copyright Law, which allows non-commercial research data usage under specified conditions. For this reason, the dataset is distributed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 (CC BY-NC-ND 4.0) license. This license strictly restricts usage to non-commercial research, prohibits redistribution of altered versions, and mandates proper author attribution.

Acknowledgments This work was conducted with the financial support of (1)

the PNRR MUR project PE0000013-FAIR and (2) the Italian ministry of economic development, via the ICARUS (Intelligent Contract Automation for Rethinking User Services) project (CUP: B69J23000270005). ing biomedical question answering corpora from Linguistics, Brussels, Belgium, 2018, pp. 2357–2368. q&a forums, IEEE Access 8 (2020) 161042–161051. URL: https://doi.org/10.18653/v1/d18-1258. doi:10. doi:10.1109/ACCESS.2020.3020868. 18653/V1/D18- 1258. [10] M. Zhu, A. Ahuja, W. Wei, C. K. Reddy, A hi- [17] J. He, M. Fu, M. Tu, Applying deep matching erarchical attention retrieval model for health- networks to chinese medical question answering: care question answering, in: The World Wide a study and a dataset, BMC Medical InformatWeb Conference, WWW ’19, Association for Com- ics Decis. Mak. 19-S (2019) 91–100. URL: https: puting Machinery, New York, NY, USA, 2019, p. //doi.org/10.1186/s12911-019-0761-8. doi:10.1186/ 2472–2482. URL: https://doi.org/10.1145/3308558. S12911- 019- 0761- 8.

3313699. doi:10.1145/3308558.3313699. [18] A. Nentidis, A. Krithara, K. Bougiatiotis, [11] M. A. Weinzierl, S. M. Harabagiu, The uni- M. Krallinger, C. R. Penagos, M. Villegas, versity of texas at dallas hltri’s participation in G. Paliouras, Overview of bioasq 2020: The EPIC-QA: searching for entailed questions reveal- eighth bioasq challenge on large-scale biomedical ing novel answer nuggets, CoRR abs/2112.13946 semantic indexing and question answering, (2021) –. URL: https://arxiv.org/abs/2112.13946. in: A. Arampatzis, E. Kanoulas, T. Tsikrika, arXiv:2112.13946. S. Vrochidis, H. Joho, C. Lioma, C. Eickhof, [12] T. Möller, A. Reina, R. Jayakumar, M. Pietsch, A. Névéol, L. Cappellato, N. Ferro (Eds.), ExperiCOVID-QA: A question answering dataset for mental IR Meets Multilinguality, Multimodality, COVID-19, in: ACL 2020 Workshop on Natural and Interaction - 11th International Conference of Language Processing for COVID-19 (NLP-COVID), the CLEF Association, CLEF 2020, Thessaloniki, ACL, Online, 2020, pp. –. URL: https://openreview. Greece, September 22-25, 2020, Proceedings, net/forum?id=JENSKEEzsoU. volume 12260 of Lecture Notes in Computer Science, [13] S. Šuster, W. Daelemans, CliCR: a dataset of Springer, Thessaloniki, Greece, 2020, pp. 194–214. clinical case reports for machine reading compre- URL: https://doi.org/10.1007/978-3-030-58219-7_16. hension, in: M. Walker, H. Ji, A. Stent (Eds.), doi:10.1007/978- 3- 030- 58219- 7\_16. Proceedings of the 2018 Conference of the North [19] D. Vilares, C. Gómez-Rodríguez, HEAD-QA: A American Chapter of the Association for Compu- healthcare dataset for complex reasoning, in: A. Kotational Linguistics: Human Language Technolo- rhonen, D. Traum, L. Màrquez (Eds.), Proceedings gies, Volume 1 (Long Papers), Association for Com- of the 57th Annual Meeting of the Association for putational Linguistics, New Orleans, Louisiana, Computational Linguistics, Association for Com2018, pp. 1551–1563. URL: https://aclanthology.org/ putational Linguistics, Florence, Italy, 2019, pp.

N18-1140/. doi:10.18653/v1/N18- 1140. 960–966. URL: https://aclanthology.org/P19-1092/. [14] A. B. Abacha, E. Agichtein, Y. Pinter, D. Demner- doi:10.18653/v1/P19- 1092.

Fushman, Overview of the medical question an- [20] A. Pal, L. K. Umapathi, M. Sankarasubbu, Medmcqa: swering task at TREC 2017 liveqa, in: E. M. A large-scale multi-subject multi-choice dataset for Voorhees, A. Ellis (Eds.), Proceedings of The medical domain question answering, in: G. FloTwenty-Sixth Text REtrieval Conference, TREC res, G. H. Chen, T. J. Pollard, J. C. Ho, T. Nau2017, Gaithersburg, Maryland, USA, November 15- mann (Eds.), Conference on Health, Inference, and 17, 2017, volume 500-324 of NIST Special Publi- Learning, CHIL 2022, 7-8 April 2022, Virtual Event, cation, National Institute of Standards and Tech- volume 174 of Proceedings of Machine Learning nology (NIST), Gaithersburg, Maryland, USA, Research, PMLR, Online, 2022, pp. 248–260. URL: 2017, pp. –. URL: https://trec.nist.gov/pubs/trec26/ https://proceedings.mlr.press/v174/pal22a.html. papers/Overview-QA.pdf. [21] Y. Tian, W. Ma, F. Xia, Y. Song, ChiMed: A Chi[15] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, nese medical corpus for question answering, in: P. Szolovits, What disease does this patient have? D. Demner-Fushman, K. B. Cohen, S. Ananiadou, a large-scale open domain question answering J. Tsujii (Eds.), Proceedings of the 18th BioNLP dataset from medical exams, 2020. URL: https:// Workshop and Shared Task, Association for Comarxiv.org/abs/2009.13081. arXiv:2009.13081. putational Linguistics, Florence, Italy, 2019, pp. [16] A. Pampari, P. Raghavan, J. J. Liang, J. Peng, emrqa: 250–260. URL: https://aclanthology.org/W19-5027/.

A large corpus for question answering on electronic doi:10.18653/v1/W19- 5027. medical records, in: E. Rilof, D. Chiang, J. Hock- [22] A. Peñas, E. H. Hovy, P. Forner, Á. Rodrigo, enmaier, J. Tsujii (Eds.), Proceedings of the 2018 R. F. E. Sutclife, R. Morante, QA4MRE 2011Conference on Empirical Methods in Natural Lan- 2013: Overview of question answering for maguage Processing, Brussels, Belgium, October 31 - chine reading evaluation, in: P. Forner, H. Müller, November 4, 2018, Association for Computational R. Paredes, P. Rosso, B. Stein (Eds.), Information Access Evaluation. Multilinguality, Multimodal- duction, CoRR abs/1802.03426 (2018) –. URL: http: ity, and Visualization - 4th International Confer- //arxiv.org/abs/1802.03426. arXiv:1802.03426. ence of the CLEF Initiative, CLEF 2013, Valen- [30] M. F. Rahman, W. Liu, S. B. Suhaim, S. Thirucia, Spain, September 23-26, 2013. Proceedings, muruganathan, N. Zhang, G. Das, HDBSCAN: volume 8138 of Lecture Notes in Computer Sci- density based clustering over location based serence, Springer, Valencia, Spain, 2013, pp. 303–320. vices, CoRR abs/1602.03730 (2016) –. URL: http: URL: https://doi.org/10.1007/978-3-642-40802-1_29. //arxiv.org/abs/1602.03730. arXiv:1602.03730. doi:10.1007/978-3-642-40802-1\_29. [31] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, [23] D. Pappas, I. Androutsopoulos, H. Papageorgiou, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, M. L.

Bioread: A new dataset for biomedical reading com- Li, Benchmarking large language models on prehension, in: N. Calzolari, K. Choukri, C. Cieri, cmexam - A comprehensive chinese medical exam T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Mae- dataset, in: A. Oh, T. Naumann, A. Globerson, gaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, K. Saenko, M. Hardt, S. Levine (Eds.), Advances S. Piperidis, T. Tokunaga (Eds.), Proceedings of the in Neural Information Processing Systems 36: Eleventh International Conference on Language Annual Conference on Neural Information Resources and Evaluation, LREC 2018, Miyazaki, Processing Systems 2023, NeurIPS 2023, New Japan, May 7-12, 2018, European Language Re- Orleans, LA, USA, December 10 - 16, 2023, sources Association (ELRA), Miyazaki, Japan, 2018, NeurIPS, New Orleans, LA, USA, 2023, pp. –. URL: pp. –. URL: http://www.lrec-conf.org/proceedings/ http://papers.nips.cc/paper_files/paper/2023/hash/ lrec2018/summaries/795.html. a48ad12d588c597f4725a8b84af647b5-Abstract-Datasets_ [24] D. Pappas, P. Stavropoulos, I. Androutsopou- and_Benchmarks.html.

los, R. McDonald, BioMRC: A dataset for [32] Y. Jin, M. Chandra, G. Verma, Y. Hu, M. D. Choudbiomedical machine reading comprehension, in: hury, S. Kumar, Better to ask in english: CrossD. Demner-Fushman, K. B. Cohen, S. Anani- lingual evaluation of large language models for adou, J. Tsujii (Eds.), Proceedings of the 19th healthcare queries, in: T. Chua, C. Ngo, R. Kumar, SIGBioMed Workshop on Biomedical Language H. W. Lauw, R. K. Lee (Eds.), Proceedings of the Processing, Association for Computational Lin- ACM on Web Conference 2024, WWW 2024, Singaguistics, Online, 2020, pp. 140–149. URL: https:// pore, May 13-17, 2024, ACM, Singapore, 2024, pp. aclanthology.org/2020.bionlp-1.15/. doi:10.18653/ 2627–2638. URL: https://doi.org/10.1145/3589334. v1/2020.bionlp-1.15. 3645643. doi:10.1145/3589334.3645643. [25] C. Christophe, P. K. Kanithi, T. Raha, S. Khan, [33] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, M. A. Pimentel, Med42-v2: A suite of clinical Y. Artzi, Bertscore: Evaluating text generation with llms, CoRR abs/2408.06142 (2024) –. URL: https:// BERT, in: 8th International Conference on Learning doi.org/10.48550/arXiv.2408.06142. doi:10.48550/ Representations, ICLR 2020, Addis Ababa, Ethiopia, ARXIV.2408.06142. arXiv:2408.06142. April 26-30, 2020, OpenReview.net, Addis Ababa, [26] DeepMount00, Italian_ner_xxl, https://hugging- Ethiopia, 2020, pp. –. URL: https://openreview.net/ face.co/DeepMount00/Italian_NER_XXL, 2024. forum?id=SkeHuCVFDr. [27] M. Grootendorst, Bertopic: Neural topic model- [34] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, ing with a class-based TF-IDF procedure, CoRR Z. Ding, Chinese medical question answer matchabs/2203.05794 (2022) –. URL: https://doi.org/ ing using end-to-end character-level multi-scale 10.48550/arXiv.2203.05794. doi:10.48550/ARXIV. cnns, Applied Sciences 7 (2017) 767.

2203.05794. arXiv:2203.05794. [35] N. Yagnik, J. Jhaveri, V. Sharma, G. Pila, A. Ben, [28] N. Reimers, I. Gurevych, Sentence-bert: Sen- J. Shang, Medlm: Exploring language models tence embeddings using siamese bert-networks, in: for medical question answering systems, CoRR K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of abs/2401.11389 (2024) –. URL: https://doi.org/ the 2019 Conference on Empirical Methods in Nat- 10.48550/arXiv.2401.11389. doi:10.48550/ARXIV. ural Language Processing and the 9th International 2401.11389. arXiv:2401.11389.

Joint Conference on Natural Language Processing, [36] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. SzilEMNLP-IJCNLP 2019, Hong Kong, China, Novem- vasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, H. Jégou, ber 3-7, 2019, Association for Computational Lin- The faiss library, arXiv preprint arXiv:2401.08281 guistics, Hong Kong, China, 2019, pp. 3980–3990. (2024). arXiv:2401.08281.

URL: https://doi.org/10.18653/v1/D19-1410. doi:10. [37] H. Tran, Z. Yang, Z. Yao, H. Yu, Bioinstruct: 18653/V1/D19-1410. instruction tuning of large language models for [29] L. McInnes, J. Healy, UMAP: uniform manifold biomedical natural language processing, J. Am. approximation and projection for dimension re- Medical Informatics Assoc. 31 (2024) 1821–1832.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and DeepL Write / DeepL Translate in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19-1423/. doi: 10 .18653/ v1/ N19 - 1423.

[2]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized BERT pretraining approach , CoRR abs/ 1907 .11692 ( 2019 ) - . URL: http: //arxiv.org/abs/ 1907 .11692. arXiv: 1907 .11692.

[3]

Lee ,

Yoon ,

Kim ,

C. H.

So ,

Kang , Biobert: a pre-trained biomedical language representation model for biomedical text mining , Bioinform . 36 ( 2020 ) 1234 - 1240 . URL: https://doi.org/10.1093/bioinformatics/btz682. doi: 10 .1093/BIOINFORMATICS/BTZ682.

[4]

Rajpurkar ,

Zhang ,

Lopyrev , P. Liang, SQuAD: 100 ,000+ questions for machine comprehension of text , in: J. Su , K. Duh , X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Austin, Texas, 2016 , pp. 2383 - 2392 . URL: https://aclanthology.org/ D16-1264/. doi: 10 .18653/v1/ D16 - 1264.

[5]

Yang ,

Dai ,

Yang , J. G. Carbonell, R. Salakhutdinov,

Q. V.

Le , Xlnet: Generalized autoregressive pretraining for language understanding , in: H. M. Wallach , H.

Larochelle , A.

Beygelzimer , F.

d'Alché-

Buc , E. B.

Fox , R. Garnett (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019 , NeurIPS 2019 , December 8- 14 , 2019 , Vancouver, BC, Canada, NeurIPS, Vancouver, BC, Canada, 2019 , pp. 5754 - 5764 . URL: https://proceedings.neurips.cc/paper/2019/hash/ dc6a7e655d7e5840e66733e9ee67cc69-Abstract. html.

[6]

Kwiatkowski ,

Palomaki ,

Redfield , M. Collins,

Parikh ,

Alberti ,

Epstein , I. Polosukhin ,

Devlin ,

Lee ,

Toutanova ,

Jones ,

Kelcey , M.-

Chang ,

A. M.

Dai ,

Uszkoreit ,

Le ,

Petrov , Natural questions: A benchmark for question answering research, Transactions of the Association for Computational Linguistics 7 ( 2019 ) 452 - 466 . URL: https://aclanthology.org/Q19-1026/. doi: 10 .1162/tacl_a_ 00276 .

[7]

Tsatsaronis , G. Balikas,

Malakasiotis , I. Partalas,

Zschunke ,

M. R.

Alvers ,

Weissenborn ,

Krithara ,

Petridis ,

Polychronopoulos ,

Almirantis ,

Pavlopoulos ,

Baskiotis ,

Gallinari ,

Artières ,

A. N.

Ngomo ,

Heino , É. Gaussier,

Barrio-Alvers ,

Schroeder , I. Androutsopoulos, G. Paliouras, An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , BMC Bioinform . 16 ( 2015 ) 138 : 1 - 138 : 28 . URL: https:// doi.org/10.1186/s12859-015-0564-6. doi: 10 .1186/ S12859- 015- 0564- 6.

[8]

Wang ,

Huang ,

Jackson ,

Gao , Retrieve what you need: A mutual learning framework for open-domain question answering , Trans. Assoc. Comput. Linguistics 12 ( 2024 ) 247 - 263 . URL: https: //doi.org/10.1162/tacl_a_00646. doi: 10 .1162/TACL\ _A\_ 00646 .

[9]

Lamurias ,

Sousa ,

F. M.

Couto , GeneratURL: https://doi.org/10.1093/jamia/ocae122. doi: 10 . 1093/JAMIA/OCAE122.

[38]

F. J.

Dorfner ,

Dada ,

Busch ,

M. R.

Makowski , T. Han,

Truhn ,

Kleesiek ,

Sushil ,

Lammert ,

L. C.

Adams ,

K. K.

Bressem , Biomedical large languages models seem not to be superior to generalist models on unseen medical data , CoRR abs/2408 .13833 ( 2024 ) - . URL: https:// doi.org/10.48550/arXiv.2408.13833. doi: 10 .48550/ ARXIV.2408.13833. arXiv: 2408 . 13833 .

[39] C. Van Nguyen , X.

Shen , R.

Aponte , Y.

Xia , S.

Basu , Z.

Hu , J.

Chen , M.

Parmar , S.

Kunapuli , J.

Barrow , et al., A survey of small language models , arXiv preprint arXiv:2410.20011 ( 2024 ).

[40]

Bengio ,

Louradour ,

Collobert ,

Weston , Curriculum learning , in: Proceedings of the 26th Annual International Conference on Machine Learning , ICML '09, Association for Computing Machinery, New York, NY, USA, 2009 , p. 41 - 48 . URL: https: //doi.org/10.1145/1553374.1553380. doi: 10 .1145/ 1553374.1553380.

[41] M. H. Daniel Han , U. team, Unsloth, 2023 . URL: http://github.com/unslothai/unsloth.