DIMMI - Drug InforMation Mining in Italian: A CALAMITA Challenge Raffaele Manna1,† , Maria Pia di Buono1,*,† and Luca Giordano1,† 1 University of Naples "L’Orientale", via Duomo 219, 80139 Napoli, Italy Abstract Patients’ knowledge about drugs and medications is crucial as it allows them to administer them safely. This knowledge frequently comes from written prescriptions, patient information leaflets (PILs), or from reading drug Web pages. DIMMI (Drug InforMation Mining in Italian) is a challenge aiming at evaluating the proficiency of Large Language Models in extracting drug-specific information from PILs. The challenge seeks to advance the understanding of effectiveness in processing complex medical information in Italian, and to enhance drug information extraction and pharmacovigilance efforts. Participants are provided with a dataset of 600 Italian PILs and the objective is to develop models capable of accurately answering specific questions related to drug dosage, usage, side effects, drug-drug interactions. The challenge should be approached as an information extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understanding or through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by the models will be compared against the gold standard (GS), created to establish a reliable, accurate, and a comprehensive set of answers against which participant submissions can be evaluated. For each drug and each information category, the GS contains the correct information extracted from the leaflets through a manual annotation. Keywords Patient information leaflets, Information extraction, Large Language Models, Italian 1. Introduction and Motivation are not easy to understand. Recently, there has been a growing interest in the uti- Patients’ knowledge about drugs and medications is cru- lization of Large Language Models (LLMs) within the cial as it allows them to administer them safely. This medical field to improve various aspects of healthcare, in- knowledge frequently comes from written prescriptions, cluding medical education and clinical decision-making patient information leaflets (PILs), or from reading drug support [5]. Several specialized medical LLMs have been Web pages. Nevertheless, this information has been de- developed through novel pre-training methodologies or scribed as often inconsistent, incomplete, and difficult for enhancements of existing models. Moreover, several eval- patients to read and understand [1]. Despite the fact that uation campaigns have been undertaken to evaluate the in 2009 the European Commission issued guidelines1 efficacy of natural language processing models in facilitat- to recommend the publication of patient information ing knowledge retrieval for clinicians and patients alike. leaflets with accessible and understandable information Examples of such campaigns are the 1) Medical Question for patients, several scholars [2, 3, 4] account for the Answering Task at TREC-2017 LiveQA [6] and subsequent absence of improvement in the readability of such docu- studies [7], which led to two datasets, LiveQA and Med- ments. Thus, educating patients about their medications icationQA; 2) the tasks on Medical Consumer Question seems to be a challenging task due to the linguistic na- Answering proposed by Nguyen et al. [8] based on their ture of drug written information, which includes a high dataset MedRedQA. Both campaigns have contributed presence of specialized terms used to describe adverse significantly to bridging the gap between consumers’ drug reactions, diseases and other medical concepts that medication questions and trusted answers, and, more generally, to the development of resources tailored to CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, healthcare information retrieval. For a thorough survey Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. of evaluation campaigns on clinical natural language pro- † These authors contributed equally. cessing refer to Filannino and Uzuner [9]. $ raffaele.manna@unior.it (R. Manna); mpdibuono@unior.it The application of LLMs as patient assistants to support (M. P. d. Buono); giordanoluca.uni@gmail.com (L. Giordano) drug knowledge and ease their administration seems very  0009-0006-6285-8557 (R. Manna); 0009-0009-9372-3323 attractive, however it needs to be evaluated carefully (M. P. d. Buono); 0009-0002-3048-4408 (L. Giordano) due to the presence of model hallucinations, potentially © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1 Attribution 4.0 International (CC BY 4.0). GUIDELINE ON THE READABILITY OF THE LABELLING AND causing medical malpractice [10], as any concealed in- PACKAGE LEAFLET OF MEDICINAL PRODUCTS FOR HUMAN accuracies in diagnoses and health advice could lead to USE - European Commission (2009). severe outcomes [11]. For these reasons, in the evolving https://health.ec.europa.eu/document/download/ landscape of Artificial Intelligence (AI) applications in d8612682-ad17-40e3-8130-23395ec80380_en CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings medicine, considerations have been raised regarding the represents a subset of 600 entries randomly selected from regulatory approval of LLMs as medical devices, high- the D-LeafIT corpus. lighting the ethical and legal dimensions associated with It is worth stressing that the information is extracted deploying such technologies in healthcare settings [12]. from pdf files and converted into texts, this means that To delve deeper into this topic, within the CALAMITA some errors and typos may occur. Furthermore, the orig- campaign [13], we introduce DIMMI (Drug InforMation inal D-LeafIT presents some data noise, e.g., the pres- Mining in Italian), a challenge centered on evaluating the ence of paratext, and wrong encoding from pdf files. To proficiency of LLMs in extracting drug-specific informa- fix these issues, we perform a cleaning procedure as a tion from Italian PILs. pre-processing phase, to obtain the final dataset. The By this, the task aims at contributing to the development procedure is mainly automatic and based on recurrent of AI systems for enhancing drug information extraction patterns, so that some of the aforementioned issues could and pharmacovigilance efforts, specifically for the Italian be still present. The dataset pre-processing phase can be language. summarized in two main steps, that are: • Correcting the separation of each leaflets by iden- 2. DIMMI tifying regular patterns which indicate the begin- ning/end of a unique leaflet. As DIMMI seeks to advance the understanding of LLM • Removing additional information about the is- effectiveness in processing complex medical information sue date, the pharmaceutical company, and the in Italian, participants are provided with the complete marketing authorization. leaflets for each drug and the objective is to develop mod- els capable of accurately answering specific questions Additionally, we notice the presence of several cases of related to a drug, such as its dosage, usage, etc. duplicate entries, due to different reasons, as described The challenge should be approached as an information below: extraction task through a zero-shot mode, purely based 1. Same drug name, same dosage form, same ingre- on the model pre-existing knowledge and understand- dient amount, different issue dates → These ing or through in-context learning (Retrieval-Augmented cases indicate that the leaflet has been updated Generation (RAG) or few-shot mode). The answers gen- and all the versions are recorded into the AIFA erated by the models will be compared against the gold repository. In such cases, on the basis of their ID, standard (GS), created to establish a reliable, accurate, the less recent leaflet has been removed. and comprehensive set of answers against which partic- 2. Same drug name, same dosage form, different in- ipant submissions can be evaluated. For each drug and gredient amount → These cases may present, or each information category (e.g., dosage, usage, side ef- not, the same information leaflets. We do not re- fects, drug-drug interactions), the GS contains the correct move the duplicate entries, even though they information extracted from the leaflets, manually anno- present the same information about the classes tated according to some categories described in Section we are interested in. 4.1. 3. Same drug name, different dosage form, same ingredient amount → These duplicates are not 3. Data description removed as dosage information can be differen- tiated on the basis of the drug form. 3.1. Origin of data 4. Same drug name, same dosage form, same ingre- dient amount, different pharmaceutical com- The challenge dataset is derived from the D-LeafIT Cor- 2 pany - These duplicates are removed and just pus [14], available on GitHub , made up of 1819 Italian one entry is kept. We usually prefer keeping the drug package leaflets. The corpus has been created ex- one reporting in the name ’DOC generici’. If tracting PILs available on the Italian Agency for Medi- 3 this is not possible, we keep the first occurrence. cations (Agenzia Italiana del Farmaco - AIFA) website , among which 1439 refer to generic drugs and 380 to class A drugs. 3.2. Data format In the original corpus, the generic drug leaflets amount to The whole leaflets are provided in the dataset, so that the 6,154,007 tokens while the class A to 1,650,879 tokens, for context is available. Additionally we provide the drug a total amount of 7,804,886 tokens. The DIMMI dataset name for each leaflet. The final dataset, released 4 as a .tsv (tab-separated values) format, contains four columns. For 2 https://github.com/unior-nlp-research-group/D-LeafIT 3 4 https://www.aifa.gov.it/en/home https://huggingface.co/datasets/RafaMann/DIMMI ID ID_Loc Drug_Name Text 119 119_276 BOTAM BOTAM 0,4 mg capsule (...) Tamsulosina cloridrato Medicinale equivalente (...). Table 1 Example of a DIMMI entry each entry we present an ID, an ID_LOC which indicates 2. Per cosa si usa {drug_name}? (What is the id location in the original corpus, the drug name {drug_name} used for?) - to extract the usage (without any reference to the ingredient amount, the 3. Qual è la posologia raccomandata per dosage form, and the pharmaceutical company), and the {drug_name}? (What is the recommended leaflet text (Table 1). dosage for {drug_name}?) - to extract the dosage Participants in the DIMMI challenge are required to 4. Quali sono gli effetti collaterali di {drug_name}? use LLMs to extract the following information from the (What are the potential side effects of taking PILs text: ’Molecule’, ’Usage’, ’Dosage’, ’Drug Interaction’, {drug_name}?) - to extract side effects and ’Side Effect’. These information must be provided 5. Con quali medicinali interagisce {drug_name}? as output in a structured format such as TSV or JSON, (What are the drug interaction of {drug_name}?) with reference to each ID and drug name contained in the - to extract the interaction with other drugs evaluation dataset. The information extracted for each ID and drug name with reference to ’Molecule’, ’Usage’, The latter type of prompt aims at extracting all the rele- ’Dosage’, ’Drug Interaction’, and ’Side Effect’ must be vant information with a specific instruction to help the represented in the form of a list of strings (see Section model understand the expected output structure and fa- 4.2). cilitates extraction as it follows: The evaluation dataset for the DIMMI corpus contains • Fornisci le seguenti informazioni su {drug_name}: columns for the following entity types: ’Molecule’, ’Us- Molecola: age’, ’Dosage’, ’Drug Interaction’, and ’Side Effect’. For Uso: each instance (drug leaflet) in the DIMMI corpus, these Posologia: entity-specific columns are populated with a list of Effetti collaterali: strings, representing the annotated entities of the corre- Interazioni con altri medicinali: sponding type. (Provide the following information about The ’Molecule’ column will contain a list of the unique {drug_name}: molecular entities mentioned in the text, while the ’Usage’ Molecule: column will include a list of the specific uses or indica- Usage: tions for the drug. The ’Dosage’ column will hold a list of Dosage: the textual spans describing the dosage, administration, Side Effects: or regimen information. The ’Drug Interaction’ column Drug interaction:) will contain a list of the potential interactions with other drugs, and the ’Side Effect’ column will include a list of the adverse effects associated with the drug. 3.4. Dataset statistics As mentioned before, the final dataset is composed by 3.3. Prompting 600 unique PILs in Italian, providing a comprehensive dataset for the challenge. The documents in the DIMMI For each drug in the dataset, we evaluate the results from dataset exhibit a wide range of lengths (Table 2), with the two types of zero-shot prompts in Italian, i.e., specific shortest document containing 363 tokens and the longest task-focused prompts and structured prompts. extending to 11,730 tokens. This range in token count directly corresponds to the word count, indicating that The former type is composed of five questions for each each word is treated as a single token in this analysis. On of the information type we want to extract, as reported average, each document contains approximately 2,520 below5 : words, with a standard deviation of 848 words, indicating 1. Qual è la molecola di {drug_name}? (What is moderate variability in document length. The distribu- the molecule of {drug_name}?) - to extract the tion of document lengths is further characterized by the molecule 25th, 50th (median), and 75th percentiles, which are 1,960, 2,448, and 2,980.75 words, respectively. 5 It is worth stressing that in the prompt examples {drug_name} is In total, the corpus contains 1,511,724 words and to- not a masked word, it represents a placeholder to indicate one of kens. The lexical diversity of the corpus is reflected in the the entries from the column drug_name in DIMMI dataset. DIMMI Statistics 4.1. Gold Standard Creation num_documents 600 mean_length 2519.54 In order to evaluate the system results, we created a gold min_length 363 standard (GS), manually annotating the following cate- max_length 11730 gories: i) molecule; ii) dosage; iii) drug interaction; iv) std_length 848.41 usage; v) side effect. For each of the aforementioned percentiles .25:1960, .5: 2448, .75: 2980 classes we define some guidelines and specifications for total_words 1511724 the annotation, as summarised in the following para- mean_words_per_doc 2519.54 total_tokens 1511724 graphs. min_tokens 363 max_tokens 11730 Molecule The category is used to identify the main unique_tokens 58901 ingredient(s) of the drug. In some cases, the bulking type_token_ratio .038 agent(s) may be reported together with the molecule(s). Table 2 These are not included in the molecule class. DIMMI statistics Dosage information This class refers to the recom- mended dosage for drug administration. We do not 58,901 unique tokens identified, resulting in a type-token annotate the treatment duration neither the maximum ratio (TTR) of 0.0390. This relatively low TTR suggests a dosage in the dosage information. high degree of repetition within the text, which is typi- For dosage information we distinguish between dosage cal for technical and regulatory documents such as drug for children and adults. We do not distinguish dosage package leaflets. Importantly, there are no empty docu- for infants or elders (the former is annotated as dosage ments in the corpus, ensuring that all entries contribute information for children, the latter as dosage information meaningful content to the dataset. for adults, as reported below). When the same dosage can be used for both adults and children, the general dosage information category is 4. Evaluation metrics applied. We will evaluate the results using accuracy, precision, Example: recall and F-1 score using a gold standard as benchmark 10 mg una volta al giorno negli adulti e nei bambini di età (see Section 4.1). uguale o superiore ai 10 anni The details for each metric are provided below: (10 mg once a day in adults and children aged 10 years • Precision metric: For example: Dosage: If the or older) model extracts "200mg-400mg every 4-6 hours" and this is correct, the precision for dosage is Furthermore, dosage information could be differ- 100%; Side Effects: If the model extracts "Stom- entiated on the basis of age/weight. In such cases, ach upset, nausea" and this is partially correct unless dosages for adults and children are explicitly (missing other side effects), the precision for side differentiated, we always use the general category effects might be 50% (depending on how many dosage. side effects are correctly identified); • Recall metric: For example: Dosage: If the cor- Example: rect dosage is "200mg-400mg every 4-6 hours" Adulti, anziani e bambini di età pari o superiore a 12 anni and the model extracts only "200mg-400mg," the con un peso corporeo pari o superiore a 50 chilogrammi recall for dosage is 50%. Side Effects: If the correct (kg): • da 1 a 2 g una volta al giorno a seconda della side effects are "Stomach upset, nausea, dizziness, gravità e del tipo di infezione headache" and the model extracts "Stomach upset, (Adults, elderly, and children aged 12 years and older nausea," the recall for side effects is 50%. with a body weight of 50 kilograms (kg) or more: • 1 to • F1-score metric: A balanced measure of preci- 2 g once a day, depending on the severity and type of sion and recall. A higher F1-score indicates better infection.) performance. • Accuracy: The overall percentage of correct ex- Dosage for infants can be expressed through a co- tractions across all classes. As far as this metric reference to some other dosage, e.g., for adults or is concerned, we also evaluate the class-Level children, sometimes with a different time schedule, as Accuracy, as the accuracy for each specific class in lo stesso dosaggio sopra descritto ma somministrato separately. una volta ogni due giorni (The same dosage as described above, but administered once every two days.). Unless the minimal span that conveys the information about the dosage is explicitly mentioned, we do not annotate drug interactions, as follows: these spans, as the information is context-dependent. 1. Molecule The treatment of specific diseases might require different 2. Drug class dosages for the same drug. When they are reported in 3. Drug use the leaflet, following a minimum span principle, we annotate all the dosages without any specification about The aforementioned hierarchy helps us identify the span the disease. Due to the aforementioned annotation to be annotated. When included, drug names are always choice, the annotation results will be a set of dosage annotated. information, as in the following example (annotated When the interaction information is reported with the spans reported in bold face). specific pharmaceutical form (e.g., eritromicina inietta- bile), only the minimal possible span is annotated, i.e., Example: eritromicina. Aspergillosi: - 2 capsule una volta al giorno per un In some cases, examples of interacting molecules or drug periodo di 2-5 mesi; (...) Candidosi: 1-2 capsule 1 volta names are provided alongside the drug class or use (e.g., al giorno per un periodo da 3 settimane a 7 mesi (...) medicinali usati per il trattamento dell’HIV AIDS, per es- Criptococcosi non meningea: 2 capsule una volta al empio ketoconazolo e itraconazolo - medicines used for giorno per un periodo dai 2 mesi ad 1 anno (...) the treatment of HIV/AIDS, for example, ketoconazole and itraconazole). In these instances, we annotate both, When the same dosage can be applied in more as the list of drugs and molecules may not be exhaustive. than one cases, span duplicates may be present (e.g., 2 If the list is exhaustive, we do not annotate the general capsule una volta al giorno). In the final GS, these are reference to the drug use; we only annotate the drug removed so that only one span for each type is kept. molecules or names. Some drugs must be administered according to a Interactions with some other molecules can be condi- schedule that spans different time periods, with or tioned by the taken amount, e.g. cimetidina, preso in without dosage variations. In such cases, we annotate dosi giornaliere superiori a 800mg (cimetidine, taken in only the initial recommended dosage. daily doses greater than 800 mg). Also in these cases the molecule name is the only span annotated. In some cases, the posology section does not pro- Some interacting drugs are reported as the general drug vide specific dosage information and instead includes a class, together with a plain language explanation and a general recommendation to consult a doctor. In these subclass specification, as in the following example instances, we consider the information to be missing and diuretici (compresse per urinare in particolare quelli chia- do not annotate the general statement. mati risparmiatori di potassio) (diuretics (tablets for urination, particularly those called potassium-sparing)) Drug interaction As for drug interactions„ we anno- As the molecule is not noted, we do annotate both the tate the name of molecules and drugs when they are general class and sublcass (both in bold face in the previ- available. In some cases, the information about drug in- ous excerpt). teraction is reported as a general reference to the use of Additionally, also food and beverage can interact with some drugs (e.g., medicinali per abbassare la pressione - drugs, e.g., pompelmo, alcol (grapefruit, alcohol). We opt medicines to lower blood pressure). In such instances, not to include these substances within the drug interac- as we cannot identify the specific molecule or drug, we tion class, as we want to focus only on the pharmaceutical annotate the general reference. Information about drug drug interaction. interactions may also appear as a reference to certain Drug interaction information are considered missing types of relationships with other molecules, as in derivati when there is only a general sentence to the fact that della fenotiazina (phenothiazine derivatives). For our an- the use of any further drug should be reported. notations, we omit additional information and select the minimal span, in the aforementioned example, fenotiaz- Usage With respect to usage, we consider the mini- ina (phenothiazine). mal possible span, which indicates the disease treated Similarly, when the information pertains to the drug by the specific drug. Thus, for instance, in the sen- class instead of reporting the molecule, e.g., lassativi (lax- tence {drug_name} è usato nel trattamento della gotta atives), we annotate the minimal span, even though in ({drug_name} is used in the treatment of gout), we anno- some cases the drug use is specified, e.g., medicinali usati tate only gotta (gout). per trattare la stipsi (medicines used to treat constipa- In other cases, some examples of usage may be reported tion). as in traumi (ad esempio causati dallo sport) (injuries (for We apply a hierarchical priority to identify and annotate example, those caused by sports)). As those cases are not representative enough of usages, we do not include them patient/disease type, e.g., Se è HIV positivo può mostrare in the annotation, so in the previous excerpt we annotate effetti indesiderati (If you are HIV positive, you may just traumi (injuires). experience side effects). In such cases, symptoms are Within the usage section, sometimes the use of plain text annotated without any further specification. is reported together with reference to the specific disease, If duplicates are presented, those are not annotated or e.g., meningite cirptococcica - un’infezione micotica del removed in the post-processing phase, so that just one cervello (...). We always annotate the specific term for the entry for symptom type is recorded in the GS. disease and discard the plain text description. Sometimes, side effects are grouped by indicating the When the generic disease class is presented, e.g., infezioni general area (e.g., organ or functionality) affected, e.g., cutanee (skin infections), followed by a non exhaustive nervous system disorders. The information might be list of examples, we annotate just the generic use. followed by a list of specific side effects. When this is the case, we discard the general information in favor of Side effects This class indicates all the possible side the most specific one. effects caused by the drug consumption. In PILs, this It is worth stressing that other information may be type of information is generally grouped on the basis presented in PILs, for instance Precautions for use. As of the number of people affected by the side effects to we are not interested in this type of information, we do identify different diffusion levels, e.g., very common side not annotate such sections. effects, very rare side effects. We do not differentiate among the diffusion levels and consider all the side Inter-Annotator Agreement The annotation has effects belonging to the same class side_effect. In been performed by three people with computational lin- some cases, side effects affecting other subjects than the guistic backgrounds and different levels of expertise. An person consuming the drug are reported. For instance, initial inter-annotator agreement has been evaluated af- some drugs can affect the fetus as in the following ter the first draft of guidelines has been created. Border- excerpt. line cases and issues have been collected by each of the annotators and subsequently discussed and solved. The Example: guidelines have been updated accordingly and a second (...) Se assume Ricap durante le ultime fasi della gravi- round of annotation has been performed in order to com- danza, il suo bambino potrebbe manifestare i seguenti pute the final inter-annotator agreement. sintomi: problemi a respirare, colorito bluastro o violaceo The annotation round for evaluating the final inter- della pelle, convulsioni (...). annotator agreement has been performed on a subset [(...) If you take Ricap during the later stages of of 60 leaflets. pregnancy, your baby may experience the following The results, calculated before the post-processing symptoms: breathing problems, bluish or purplish skin phase, show a complete agreement on the molecule class discoloration, seizures (...)] among all the annotators, while for the remaining classes the agreement spans from .61 for posology and .80 for We do not annotate these secondary side effects side effects (Table 3). and the ones derived from drug overdose. When the side effect type is reported together with Class A1/A2 A1/A3 A2/A3 AVG its symptoms we do include those within the class Molecule 1 1 1 1 of side effects. For instance, in some cases a list of Usage .69 .67 .68 .68 Posology .61 .62 .66 .63 symptoms difficoltà respiratoria, riduzione della pressione Drug interaction .66 .66 .65 .66 sanguigna is combined with the general side effect Side effects .80 .76 .75 .78 reazioni allergiche. Each of them is annotated separately and included into the list of side effects. Table 3 Similarly, we annotate both the plain language side IAA for the GS effect and the term, as in problemi del flusso della bile (colestasi) (bile flow problems (cholestasis)). To assess the inter-annotator agreement (IAA) for the When the side effects are reported as worsening of an creation of the gold standard, we employed two different already existing disease, e.g., aumentata perdita di capelli, metrics: pairwise F1 score [15, 16] and token-level agree- we annotate the minimum possible span, i.e., perdita di ment percentage [17]. The pairwise F1 score was used to capelli. calculate the IAA for the "Molecule" and "Usage" labels, For drugs containing more than one molecule, side as the information contained in the text for these entities effects are reported along with the side effects for each refers to unique and well-defined concepts. This metric individual molecule. We annotate all of them. provides a balanced measure of the precision and recall Side effects can be reported with reference to some of the annotations, allowing us to quantify the level of agreement between annotators on the identification of di calcio nel sangue and ipercalcemia. these specific entities. This choice aims at accounting for both entities as possi- On the other hand, for the "Dosage", "Drug Interaction", ble correct answers. and "Side Effect" classes, we opted to use the token-level For instance, for the drug NATRILIX, the expected re- agreement percentage as the IAA metric. This choice sults are as it follows: was motivated by the fact that these classes involve vari- able text spans, which can be more challenging to align • Usage: pressione sanguigna elevata, ipertensione between annotators. Before calculating the token-level arteriosa essenziale agreement percentage, we performed preprocessing steps • Molecule: indapamide on the annotated portions, removing punctuation marks • Dosage: 1 compressa al giorno (such as - and • that indicate a list) and Italian stopwords • Side_effect: eruzioni cutanee, bassi livelli di potas- from the Spacy Italian language model6 . The token-level sio nel sangue, vomito, porpora ... agreement percentage provides a more granular assess- • Drug_interaction: litio, chinidina, idrochinidina, ment of the consistency in the identification of the rel- disopiramide (...) evant text segments, which is crucial for the accurate For the drug Trevid, the correct answers would be: extraction of these types of entities from the source doc- uments. • Usage: carenza di vitamina D • Molecule: colecalciferolo GS Post-processing To ensure high consistency • Dosage: 3-4 gocce al giorno among annotations and to remove additional informa- • Side_effect: livelli aumentati di calcio nel sangue, tion that does not meet the specified annotation criteria, ipercalcemia, livelli aumentati di calcio nelle urine, we perform a post-processing step. During this phase, ipercalciuria, debolezza, astenia, reazioni aller- we review the GS, using recurring patterns and regular giche, appetito ridotto (...) expressions to clean the data and correct errors. We also • Drug_interaction: anticonvulsivanti, barbiturici, carry out manual cleaning to produce the final GS. colestipolo, colestiramina, orlistat (...) For instance, when applicable, we remove the drug name mentioned in the posology specification (e.g., one tablet Since this is an information extraction task in a zero-shot of drug_name once a day) so that only the general infor- setting based on PILs, it is expected that LLMs will be mation related to the molecule is retained. able to extract the exact terminology used in the differ- The resulting evaluation dataset contains XXX annotated ent sections of the PILs and provide a list of terms. The molecules, XXX drug interactions, XXX usage informa- performance will be evaluated based on the metrics de- tion, and XXX side-effects (Table 4). scribed in 4. Potential limitations in accurately assessing the performance of LLMs may arise from: 1) the vari- Class Tot. Entities Unique Entities ability in the models’ choice of terms to extract, and 2) Molecule 657 657 the provision of terms and their simplifications as two Usage 2159 2113 entities. In these cases, forcing the LLMs to provide a Posology 831 827 more structured and less ambiguous output might help, Drug interaction 8617 8458 Side effect 36748 30313 as currently the gold standard does not account for a Total 49012 42368 set of synonyms to handle variability in the output, or employing additional metrics to address the second case. Table 4 Annotated entities for each class 5. Limitations One important limitation of the DIMMI dataset is the dis- 4.2. Results claimer provided by the Italian Medicines Agency (AIFA) The expected results should be presented as a list of en- regarding the content 7 available on their website in section tities for each of the classes of information about each A. Disclaimer . AIFA states that all the information and drug. To obtain the result lists, we consider the annotated services offered on their website are provided "as is" and terms and their simplifications as unique entities e.g., the "with all faults". The Italian Medicines Agency, therefore, span livelli aumentati di calcio nel sangue (ipercalcemia) does not provide any kind of warranty, either explicit or (elevated levels of calcium in the blood (hypercalcemia)) implied, regarding the content, including, without limi- is listed as two separate entities that are livelli aumentati tation, the legality, ownership, suitability, or fitness for particular purposes or uses. 6 7 https://spacy.io/models/it#it_core_news_lg https://www.aifa.gov.it/en/copyright This disclaimer from the data source raises concerns By making the DIMMI corpus available under the CC- about the reliability and quality of the patient informa- BY 4.0 license, the dataset can be freely accessed, utilized, tion leaflets (PILs) that were used to construct the DIMMI and built upon by the scientific community, contribut- corpus. While the dataset has been carefully curated ing to the advancement of research and applications in and annotated, the underlying data may contain errors, the field of biomedical text mining and pharmacological inaccuracies, or other issues that are not explicitly ac- information extraction. knowledged by the original provider. Researchers and developers using the DIMMI dataset should be aware of this limitation and exercise caution when relying on the Acknowledgments information contained within the corpus, particularly for Luca Giordano has been supported by Borsa di Studio critical applications or decision-making processes. GARR "Orio Carlini" 2023/24 - Consortium GARR, the National Research and Education Network. 6. Ethical issues Ethical considerations are crucial when working with a dataset that contains sensitive information from PILs. References The DIMMI corpus, which is derived from the AIFA (Ital- [1] W. H. Shrank, J. Avorn, Educating patients about ian Medicines Agency) Database, must be handled with their medications: the potential and limitations of the utmost care and respect for individual privacy, data written drug information, Health affairs 26 (2007) protection, and the diversity of the target population. 731–740. Additionally, the use of the DIMMI corpus for the de- [2] P. Rodríguez, R. Azarola, S. Lorda, B. Cantalejo, velopment and evaluation of natural language processing A. Danet, et al., Quality improvement of health models must be guided by ethical principles that consider information included in drug information leaflets. the diversity of the target population. The models trained patient and health professional expectations, Aten- on this data should be designed and deployed in a way cion primaria 42 (2009) 22–27. that respects individual privacy, avoids potential mis- [3] M. Á. Piñero-López, P. Modamio, C. F. Lastra, E. L. use or discrimination, and ultimately benefits the public Mariño, Readability analysis of the package leaflets good, regardless of ethnicity or age. Careful considera- for biological medicines available on the internet tion should be given to the potential societal impact of between 2007 and 2013: an analytical longitudinal the applications built upon the DIMMI dataset, ensuring study, Journal of medical Internet research 18 (2016) that they are inclusive and equitable. e100. By upholding the ethical standards in the handling and [4] I. Segura-Bedmar, P. Martínez, Simplifying drug utilization of the DIMMI corpus, the research community package leaflets written in spanish by using word can ensure that the valuable pharmacological information embedding, Journal of biomedical semantics 8 contained in the PILs is leveraged responsibly and in a (2017) 1–9. manner that prioritizes the well-being of patients and [5] M. Yuan, P. Bao, J. Yuan, Y. Shen, Z. Chen, Y. Xie, the general public, while respecting the diversity of the J. Zhao, Y. Chen, L. Zhang, L. Shen, et al., Large target population. language models illuminate a progressive pathway to artificial healthcare assistant: A review, arXiv 7. Data license and copyright preprint arXiv:2311.01918 (2023). [6] A. B. Abacha, E. Agichtein, Y. Pinter, D. Demner- issues Fushman, Overview of the medical question an- swering task at trec 2017 liveqa., in: TREC, 2017, The DIMMI corpus has been created using the patient in- pp. 1–12. formation leaflets (PILs) from the AIFA (Italian Medicines [7] A. B. Abacha, Y. Mrabet, M. Sharp, T. R. Goodwin, Agency) Database. As reported in the Web site8 , the distri- S. E. Shooshan, D. Demner-Fushman, Bridging the bution license used by AIFA for these data is the Creative gap between consumers’ medication questions and Commons Attribution (CC-BY) license, version 4.0. This trusted answers, in: MEDINFO 2019: Health and license allows third parties to distribute, modify, adapt, Wellbeing e-Networks for All, IOS Press, 2019, pp. and use the data, even for commercial purposes, with the 25–29. sole requirement of providing attribution to the original [8] V. Nguyen, S. Karimi, M. Rybinski, Z. Xing, source. Medredqa for medical consumer question answer- ing: Dataset, tasks, and neural baselines, in: Pro- 8 https://www.aifa.gov.it/en/copyright ceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Con- ference of the Asia-Pacific Chapter of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 629–648. [9] M. Filannino, Ö. Uzuner, Advancing the state of the art in clinical natural language processing through shared tasks, Yearbook of medical informatics 27 (2018) 184–192. [10] R. Vaishya, A. Misra, A. Vaish, Chatgpt: Is this ver- sion good for healthcare and research?, Diabetes & Metabolic Syndrome: Clinical Research & Reviews 17 (2023) 102744. [11] P. Lee, S. Bubeck, J. Petro, Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine, New England Journal of Medicine 388 (2023) 1233–1239. [12] S. Gilbert, H. Harvey, T. Melvin, E. Vollebregt, P. Wicks, Large language model ai chatbots re- quire approval as medical devices, Nature Medicine 29 (2023) 2396–2398. [13] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- naldi, D. Scalena, CALAMITA: Challenge the Abili- ties of LAnguage Models in ITAlian, in: Proceed- ings of the 10th Italian Conference on Computa- tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem- ber 4 - December 6, 2024, CEUR Workshop Proceed- ings, CEUR-WS.org, 2024. [14] L. Giordano, M. P. Di Buono, Large language models as drug information providers for pa- tients, in: Proceedings of the First Work- shop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING 2024, 2024, pp. 54– 63. [15] G. Hripcsak, A. S. Rothschild, Agreement, the f- measure, and reliability in information retrieval, Journal of the American medical informatics asso- ciation 12 (2005) 296–298. [16] L. Deleger, Q. Li, T. Lingren, M. Kaiser, K. Molnar, L. Stoutenborough, M. Kouril, K. Marsolo, I. Solti, et al., Building gold standard corpora for medical natural language processing tasks, in: AMIA An- nual Symposium Proceedings, volume 2012, Ameri- can Medical Informatics Association, 2012, p. 144. [17] C. Grouin, S. Rosset, P. Zweigenbaum, K. Fort, O. Galibert, L. Quintard, Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview, in: Proceedings of the 5th linguistic annotation workshop, 2011, pp. 92–100.