1. Introduction and Motivation

DIMMI - Drug InforMation Mining in Italian: A CALAMITA Challenge

Rafaele Manna

Maria Pia di Buono

Luca Giordano

0 0 University of Naples "L'Orientale" , via Duomo 219, 80139 Napoli , Italy

1960

Patients' knowledge about drugs and medications is crucial as it allows them to administer them safely. This knowledge frequently comes from written prescriptions, patient information leaflets (PILs), or from reading drug Web pages. DIMMI (Drug InforMation Mining in Italian) is a challenge aiming at evaluating the proficiency of Large Language Models in extracting drug-specific information from PILs. The challenge seeks to advance the understanding of efectiveness in processing complex medical information in Italian, and to enhance drug information extraction and pharmacovigilance eforts. Participants are provided with a dataset of 600 Italian PILs and the objective is to develop models capable of accurately answering specific questions related to drug dosage, usage, side efects, drug-drug interactions. The challenge should be approached as an information extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understanding or through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by the models will be compared against the gold standard (GS), created to establish a reliable, accurate, and a comprehensive set of answers against which participant submissions can be evaluated. For each drug and each information category, the GS contains the correct information extracted from the leaflets through a manual annotation.

eol>Patient information leaflets Information extraction Large Language Models Italian

1. Introduction and Motivation

CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. † These authors contributed equally. $ rafaele.manna@unior.it (R. Manna); mpdibuono@unior.it (M. P. d. Buono); giordanoluca.uni@gmail.com (L. Giordano)

0009-0006-6285-8557 (R. Manna); 0009-0009-9372-3323 (M. P. d. Buono); 0009-0002-3048-4408 (L. Giordano)

© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License due to the presence of model hallucinations, potentially 1GUIDELIANttrEibuOtionN4.0TInHterEnatRioEnaAl(CDC ABYB4.I0L).ITY OF THE LABELLING AND causing medical malpractice [ 10 ], as any concealed inPACKAGE LEAFLET OF MEDICINAL PRODUCTS FOR HUMAN accuracies in diagnoses and health advice could lead to USE - European Commission (2009). severe outcomes [ 11 ]. For these reasons, in the evolving https://health.ec.europa.eu/document/download/ landscape of Artificial Intelligence (AI) applications in d8612682-ad17-40e3-8130-23395ec80380_en medicine, considerations have been raised regarding the represents a subset of 600 entries randomly selected from regulatory approval of LLMs as medical devices, high- the D-LeafIT corpus. lighting the ethical and legal dimensions associated with It is worth stressing that the information is extracted deploying such technologies in healthcare settings [ 12 ]. from pdf files and converted into texts, this means that

To delve deeper into this topic, within the CALAMITA some errors and typos may occur. Furthermore, the origcampaign [ 13 ], we introduce DIMMI (Drug InforMation inal D-LeafIT presents some data noise, e.g., the presMining in Italian), a challenge centered on evaluating the ence of paratext, and wrong encoding from pdf files. To proficiency of LLMs in extracting drug-specific informa- ifx these issues, we perform a cleaning procedure as a tion from Italian PILs. pre-processing phase, to obtain the final dataset. The By this, the task aims at contributing to the development procedure is mainly automatic and based on recurrent of AI systems for enhancing drug information extraction patterns, so that some of the aforementioned issues could and pharmacovigilance eforts, specifically for the Italian be still present. The dataset pre-processing phase can be language. summarized in two main steps, that are:

2. DIMMI As DIMMI seeks to advance the understanding of LLM

efectiveness in processing complex medical information in Italian, participants are provided with the complete leaflets for each drug and the objective is to develop models capable of accurately answering specific questions related to a drug, such as its dosage, usage, etc. The challenge should be approached as an information extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understanding or through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by the models will be compared against the gold standard (GS), created to establish a reliable, accurate, and comprehensive set of answers against which participant submissions can be evaluated. For each drug and each information category (e.g., dosage, usage, side effects, drug-drug interactions), the GS contains the correct information extracted from the leaflets, manually annotated according to some categories described in Section 4.1.

3. Data description 3.1. Origin of data The challenge dataset is derived from the D-LeafIT Cor

pus [ 14 ], available on GitHub2, made up of 1819 Italian drug package leaflets. The corpus has been created extracting PILs available on the Italian Agency for Medications (Agenzia Italiana del Farmaco - AIFA) website3, among which 1439 refer to generic drugs and 380 to class A drugs.

In the original corpus, the generic drug leaflets amount to 6,154,007 tokens while the class A to 1,650,879 tokens, for a total amount of 7,804,886 tokens. The DIMMI dataset 2https://github.com/unior-nlp-research-group/D-LeafIT 3https://www.aifa.gov.it/en/home • Correcting the separation of each leaflets by identifying regular patterns which indicate the beginning/end of a unique leaflet. • Removing additional information about the issue date, the pharmaceutical company, and the marketing authorization.

Additionally, we notice the presence of several cases of duplicate entries, due to diferent reasons, as described below: 1. Same drug name, same dosage form, same ingredient amount, diferent issue dates → These cases indicate that the leaflet has been updated and all the versions are recorded into the AIFA repository. In such cases, on the basis of their ID, the less recent leaflet has been removed . 2. Same drug name, same dosage form, diferent ingredient amount → These cases may present, or not, the same information leaflets. We do not remove the duplicate entries, even though they present the same information about the classes we are interested in. 3. Same drug name, diferent dosage form , same ingredient amount → These duplicates are not removed as dosage information can be diferentiated on the basis of the drug form. 4. Same drug name, same dosage form, same ingredient amount, diferent pharmaceutical company - These duplicates are removed and just one entry is kept. We usually prefer keeping the one reporting in the name ’DOC generici’. If this is not possible, we keep the first occurrence.

3.2. Data format The whole leaflets are provided in the dataset, so that the

context is available. Additionally we provide the drug name for each leaflet. The final dataset, released 4 as a .tsv (tab-separated values) format, contains four columns. For

4https://huggingface.co/datasets/RafaMann/DIMMI

Drug_Name BOTAM

Text

BOTAM 0,4 mg capsule (...) Tamsulosina cloridrato Medicinale equivalente (...). each entry we present an ID, an ID_LOC which indicates 2. Per cosa si usa {drug_name}? (What is the id location in the original corpus, the drug name {drug_name} used for?) - to extract the usage (without any reference to the ingredient amount, the 3. Qual è la posologia raccomandata per dosage form, and the pharmaceutical company), and the {drug_name}? (What is the recommended leaflet text (Table 1). dosage for {drug_name}?) - to extract the dosage

Participants in the DIMMI challenge are required to 4. Quali sono gli efetti collaterali di {drug_name}? use LLMs to extract the following information from the (What are the potential side efects of taking PILs text: ’Molecule’, ’Usage’, ’Dosage’, ’Drug Interaction’, {drug_name}?) - to extract side efects and ’Side Efect’. These information must be provided 5. Con quali medicinali interagisce {drug_name}? as output in a structured format such as TSV or JSON, (What are the drug interaction of {drug_name}?) with reference to each ID and drug name contained in the - to extract the interaction with other drugs evaluation dataset. The information extracted for each ID and drug name with reference to ’Molecule’, ’Usage’, The latter type of prompt aims at extracting all the rele’Dosage’, ’Drug Interaction’, and ’Side Efect’ must be vant information with a specific instruction to help the represented in the form of a list of strings (see Section model understand the expected output structure and fa4.2). cilitates extraction as it follows: The evaluation dataset for the DIMMI corpus contains • Fornisci le seguenti informazioni su {drug_name}: caogleu’,m’Dnsosfaogret’h,e’Dforullgo wInitnegraecnttiiotny’,tyapneds’:S’iMdeolEefeccutl’e.’,F’oUrs- UMsool:ecola: seetanrctiinhtgyisn-,ssprteaepncricficeesec(dnortluuingmglnetshafleeata)reninnpotohtapetueDldaItMeendMtitIwiecisot hropfuatsh,leitshctoeosrerfe- PEofetstoilocoglilaa:terali : sponding type. (IPntreorvaizdieonitchoen aflotrlilomweidnigcinainlif:ormation about The ’Molecule’ column will contain a list of the unique {drug_name}: molecular entities mentioned in the text, while the ’Usage’ Molecule: column will include a list of the specific uses or indica- Usage: tions for the drug. The ’Dosage’ column will hold a list of Dosage: the textual spans describing the dosage, administration, Side Efects: or regimen information. The ’Drug Interaction’ column Drug interaction:) will contain a list of the potential interactions with other drugs, and the ’Side Efect’ column will include a list of the adverse efects associated with the drug. 3.4. Dataset statistics

3.3. Prompting For each drug in the dataset, we evaluate the results from

two types of zero-shot prompts in Italian, i.e., specific task-focused prompts and structured prompts.

The former type is composed of five questions for each of the information type we want to extract, as reported below5:

1. Qual è la molecola di {drug_name}? (What is

the molecule of {drug_name}?) - to extract the molecule

5It is worth stressing that in the prompt examples {drug_name} is

not a masked word, it represents a placeholder to indicate one of the entries from the column drug_name in DIMMI dataset. As mentioned before, the final dataset is composed by 600 unique PILs in Italian, providing a comprehensive dataset for the challenge. The documents in the DIMMI dataset exhibit a wide range of lengths (Table 2), with the shortest document containing 363 tokens and the longest extending to 11,730 tokens. This range in token count directly corresponds to the word count, indicating that each word is treated as a single token in this analysis. On average, each document contains approximately 2,520 words, with a standard deviation of 848 words, indicating moderate variability in document length. The distribution of document lengths is further characterized by the 25th, 50th (median), and 75th percentiles, which are 1,960, 2,448, and 2,980.75 words, respectively.

In total, the corpus contains 1,511,724 words and tokens. The lexical diversity of the corpus is reflected in the num_documents mean_length min_length max_length std_length percentiles total_words mean_words_per_doc total_tokens min_tokens max_tokens unique_tokens type_token_ratio 58,901 unique tokens identified, resulting in a type-token ratio (TTR) of 0.0390. This relatively low TTR suggests a high degree of repetition within the text, which is typical for technical and regulatory documents such as drug package leaflets. Importantly, there are no empty documents in the corpus, ensuring that all entries contribute meaningful content to the dataset.

4. Evaluation metrics We will evaluate the results using accuracy, precision, recall and F-1 score using a gold standard as benchmark (see Section 4.1). The details for each metric are provided below:

In order to evaluate the system results, we created a gold standard (GS), manually annotating the following categories: i) molecule; ii) dosage; iii) drug interaction; iv) usage; v) side efect. For each of the aforementioned classes we define some guidelines and specifications for the annotation, as summarised in the following paragraphs.

Molecule The category is used to identify the main ingredient(s) of the drug. In some cases, the bulking agent(s) may be reported together with the molecule(s). These are not included in the molecule class. Dosage information This class refers to the recommended dosage for drug administration. We do not annotate the treatment duration neither the maximum dosage in the dosage information.

For dosage information we distinguish between dosage for children and adults. We do not distinguish dosage for infants or elders (the former is annotated as dosage information for children, the latter as dosage information for adults, as reported below).

When the same dosage can be used for both adults and children, the general dosage information category is applied.

Example: 10 mg una volta al giorno negli adulti e nei bambini di età uguale o superiore ai 10 anni (10 mg once a day in adults and children aged 10 years or older) • Precision metric: For example: Dosage: If the model extracts "200mg-400mg every 4-6 hours" and this is correct, the precision for dosage is Furthermore, dosage information could be difer100%; Side Efects: If the model extracts "Stom- entiated on the basis of age/weight. In such cases, ach upset, nausea" and this is partially correct unless dosages for adults and children are explicitly (missing other side efects), the precision for side diferentiated, we always use the general category efects might be 50% (depending on how many dosage.

side efects are correctly identified); • Recall metric: For example: Dosage: If the cor- Example: rect dosage is "200mg-400mg every 4-6 hours" Adulti, anziani e bambini di età pari o superiore a 12 anni and the model extracts only "200mg-400mg," the con un peso corporeo pari o superiore a 50 chilogrammi recall for dosage is 50%. Side Efects: If the correct (kg): • da 1 a 2 g una volta al giorno a seconda della side efects are "Stomach upset, nausea, dizziness, gravità e del tipo di infezione headache" and the model extracts "Stomach upset, (Adults, elderly, and children aged 12 years and older nausea," the recall for side efects is 50%. with a body weight of 50 kilograms (kg) or more: • 1 to • F1-score metric: A balanced measure of preci- 2 g once a day, depending on the severity and type of sion and recall. A higher F1-score indicates better infection.) performance. • Accuracy: The overall percentage of correct ex- Dosage for infants can be expressed through a cotractions across all classes. As far as this metric reference to some other dosage, e.g., for adults or is concerned, we also evaluate the class-Level children, sometimes with a diferent time schedule, as Accuracy, as the accuracy for each specific class in lo stesso dosaggio sopra descritto ma somministrato separately. una volta ogni due giorni (The same dosage as described the minimal span that conveys the information about drug interactions, as follows:

1. Molecule

2. Drug class 3. Drug use above, but administered once every two days.). Unless the dosage is explicitly mentioned, we do not annotate these spans, as the information is context-dependent. The treatment of specific diseases might require diferent dosages for the same drug. When they are reported in the leaflet, following a minimum span principle, we annotate all the dosages without any specification about the disease. Due to the aforementioned annotation choice, the annotation results will be a set of dosage information, as in the following example (annotated spans reported in bold face).

The aforementioned hierarchy helps us identify the span

to be annotated. When included, drug names are always annotated.

When the interaction information is reported with the specific pharmaceutical form (e.g., eritromicina iniettabile), only the minimal possible span is annotated, i.e., Example: eritromicina.

Aspergillosi: - 2 capsule una volta al giorno per un In some cases, examples of interacting molecules or drug periodo di 2-5 mesi; (...) Candidosi: 1-2 capsule 1 volta names are provided alongside the drug class or use (e.g., al giorno per un periodo da 3 settimane a 7 mesi (...) medicinali usati per il trattamento dell’HIV AIDS, per esCriptococcosi non meningea: 2 capsule una volta al empio ketoconazolo e itraconazolo - medicines used for giorno per un periodo dai 2 mesi ad 1 anno (...) the treatment of HIV/AIDS, for example, ketoconazole and itraconazole). In these instances, we annotate both, When the same dosage can be applied in more as the list of drugs and molecules may not be exhaustive. than one cases, span duplicates may be present (e.g., 2 If the list is exhaustive, we do not annotate the general capsule una volta al giorno). In the final GS, these are reference to the drug use; we only annotate the drug removed so that only one span for each type is kept. molecules or names.

Some drugs must be administered according to a Interactions with some other molecules can be condischedule that spans diferent time periods, with or tioned by the taken amount, e.g. cimetidina, preso in without dosage variations. In such cases, we annotate dosi giornaliere superiori a 800mg (cimetidine, taken in only the initial recommended dosage. daily doses greater than 800 mg). Also in these cases the molecule name is the only span annotated.

In some cases, the posology section does not pro- Some interacting drugs are reported as the general drug vide specific dosage information and instead includes a class, together with a plain language explanation and a general recommendation to consult a doctor. In these subclass specification, as in the following example instances, we consider the information to be missing and diuretici (compresse per urinare in particolare quelli chiado not annotate the general statement. mati risparmiatori di potassio) (diuretics (tablets for urination, particularly those called potassium-sparing)) Drug interaction As for drug interactions„ we anno- As the molecule is not noted, we do annotate both the tate the name of molecules and drugs when they are general class and sublcass (both in bold face in the previavailable. In some cases, the information about drug in- ous excerpt). teraction is reported as a general reference to the use of Additionally, also food and beverage can interact with some drugs (e.g., medicinali per abbassare la pressione - drugs, e.g., pompelmo, alcol (grapefruit, alcohol). We opt medicines to lower blood pressure). In such instances, not to include these substances within the drug interacas we cannot identify the specific molecule or drug, we tion class, as we want to focus only on the pharmaceutical annotate the general reference. Information about drug drug interaction. interactions may also appear as a reference to certain Drug interaction information are considered missing types of relationships with other molecules, as in derivati when there is only a general sentence to the fact that della fenotiazina (phenothiazine derivatives). For our an- the use of any further drug should be reported. notations, we omit additional information and select the minimal span, in the aforementioned example, fenotiaz- Usage With respect to usage, we consider the miniina (phenothiazine). mal possible span, which indicates the disease treated Similarly, when the information pertains to the drug by the specific drug. Thus, for instance, in the senclass instead of reporting the molecule, e.g., lassativi (lax- tence {drug_name} è usato nel trattamento della gotta atives), we annotate the minimal span, even though in ({drug_name} is used in the treatment of gout), we annosome cases the drug use is specified, e.g., medicinali usati tate only gotta (gout). per trattare la stipsi (medicines used to treat constipa- In other cases, some examples of usage may be reported tion). as in traumi (ad esempio causati dallo sport) (injuries (for We apply a hierarchical priority to identify and annotate example, those caused by sports)). As those cases are not representative enough of usages, we do not include them in the annotation, so in the previous excerpt we annotate just traumi (injuires).

Within the usage section, sometimes the use of plain text is reported together with reference to the specific disease, e.g., meningite cirptococcica - un’infezione micotica del cervello (...). We always annotate the specific term for the disease and discard the plain text description.

When the generic disease class is presented, e.g., infezioni cutanee (skin infections), followed by a non exhaustive list of examples, we annotate just the generic use.

Side efects This class indicates all the possible side efects caused by the drug consumption. In PILs, this type of information is generally grouped on the basis of the number of people afected by the side efects to identify diferent difusion levels, e.g., very common side efects, very rare side efects. We do not diferentiate among the difusion levels and consider all the side Inter-Annotator Agreement The annotation has efects belonging to the same class side_effect. In been performed by three people with computational linsome cases, side efects afecting other subjects than the guistic backgrounds and diferent levels of expertise. An person consuming the drug are reported. For instance, initial inter-annotator agreement has been evaluated afsome drugs can afect the fetus as in the following ter the first draft of guidelines has been created. Borderexcerpt. line cases and issues have been collected by each of the annotators and subsequently discussed and solved. The Example: guidelines have been updated accordingly and a second (...) Se assume Ricap durante le ultime fasi della gravi- round of annotation has been performed in order to comdanza, il suo bambino potrebbe manifestare i seguenti pute the final inter-annotator agreement. sintomi: problemi a respirare, colorito bluastro o violaceo The annotation round for evaluating the final interdella pelle, convulsioni (...). annotator agreement has been performed on a subset [(...) If you take Ricap during the later stages of of 60 leaflets. pregnancy, your baby may experience the following The results, calculated before the post-processing symptoms: breathing problems, bluish or purplish skin phase, show a complete agreement on the molecule class discoloration, seizures (...)] among all the annotators, while for the remaining classes the agreement spans from .61 for posology and .80 for side efects (Table 3). patient/disease type, e.g., Se è HIV positivo può mostrare efetti indesiderati (If you are HIV positive, you may experience side efects) . In such cases, symptoms are annotated without any further specification.

If duplicates are presented, those are not annotated or removed in the post-processing phase, so that just one entry for symptom type is recorded in the GS.

Sometimes, side efects are grouped by indicating the general area (e.g., organ or functionality) afected, e.g., nervous system disorders. The information might be followed by a list of specific side efects. When this is the case, we discard the general information in favor of the most specific one.

It is worth stressing that other information may be presented in PILs, for instance Precautions for use. As we are not interested in this type of information, we do not annotate such sections.

We do not annotate these secondary side efects

and the ones derived from drug overdose.

When the side efect type is reported together with its symptoms we do include those within the class of side efects. For instance, in some cases a list of symptoms dificoltà respiratoria, riduzione della pressione sanguigna is combined with the general side efect reazioni allergiche. Each of them is annotated separately and included into the list of side efects.

Similarly, we annotate both the plain language side efect and the term, as in problemi del flusso della bile (colestasi) (bile flow problems (cholestasis)).

When the side efects are reported as worsening of an already existing disease, e.g., aumentata perdita di capelli, we annotate the minimum possible span, i.e., perdita di capelli.

For drugs containing more than one molecule, side efects are reported along with the side efects for each individual molecule. We annotate all of them.

Side efects can be reported with reference to some Class Molecule Usage Posology Drug interaction Side efects

A1/A2 1 .69 .61 .66 .80

A1/A3 1 .67 .62 .66 .76

A2/A3 1 .68 .66 .65 .75

AVG 1 .68 .63 .66 .78

To assess the inter-annotator agreement (IAA) for the creation of the gold standard, we employed two diferent metrics: pairwise F1 score [ 15, 16 ] and token-level agreement percentage [ 17 ]. The pairwise F1 score was used to calculate the IAA for the "Molecule" and "Usage" labels, as the information contained in the text for these entities refers to unique and well-defined concepts. This metric provides a balanced measure of the precision and recall of the annotations, allowing us to quantify the level of agreement between annotators on the identification of di calcio nel sangue and ipercalcemia. these specific entities. This choice aims at accounting for both entities as possi

On the other hand, for the "Dosage", "Drug Interaction", ble correct answers. and "Side Efect" classes, we opted to use the token-level For instance, for the drug NATRILIX, the expected reagreement percentage as the IAA metric. This choice sults are as it follows: was motivated by the fact that these classes involve variable text spans, which can be more challenging to align • Usage: pressione sanguigna elevata, ipertensione between annotators. Before calculating the token-level arteriosa essenziale agreement percentage, we performed preprocessing steps • Molecule: indapamide on the annotated portions, removing punctuation marks • Dosage: 1 compressa al giorno (such as - and • that indicate a list) and Italian stopwords • Side_efect: eruzioni cutanee, bassi livelli di potasfrom the Spacy Italian language model6. The token-level sio nel sangue, vomito, porpora ... agreement percentage provides a more granular assess- • Drug_interaction: litio, chinidina, idrochinidina, ment of the consistency in the identification of the rel- disopiramide (...) evant text segments, which is crucial for the accurate extraction of these types of entities from the source doc- For the drug Trevid, the correct answers would be: uments.

GS Post-processing To ensure high consistency among annotations and to remove additional information that does not meet the specified annotation criteria, we perform a post-processing step. During this phase, we review the GS, using recurring patterns and regular expressions to clean the data and correct errors. We also carry out manual cleaning to produce the final GS.

For instance, when applicable, we remove the drug name mentioned in the posology specification (e.g., one tablet Since this is an information extraction task in a zero-shot of drug_name once a day) so that only the general infor- setting based on PILs, it is expected that LLMs will be mation related to the molecule is retained. able to extract the exact terminology used in the diferThe resulting evaluation dataset contains XXX annotated ent sections of the PILs and provide a list of terms. The molecules, XXX drug interactions, XXX usage informa- performance will be evaluated based on the metrics detion, and XXX side-efects (Table 4). scribed in 4. Potential limitations in accurately assessing the performance of LLMs may arise from: 1) the variClass Tot. Entities Unique Entities ability in the models’ choice of terms to extract, and 2) Molecule 657 657 the provision of terms and their simplifications as two Usage 2159 2113 entities. In these cases, forcing the LLMs to provide a Posology 831 827 more structured and less ambiguous output might help, SDirduegeifenctteraction 368764187 308341538 as currently the gold standard does not account for a Total 49012 42368 set of synonyms to handle variability in the output, or employing additional metrics to address the second case.

• Usage: carenza di vitamina D • Molecule: colecalciferolo • Dosage: 3-4 gocce al giorno • Side_efect: livelli aumentati di calcio nel sangue, ipercalcemia, livelli aumentati di calcio nelle urine, ipercalciuria, debolezza, astenia, reazioni allergiche, appetito ridotto (...) • Drug_interaction: anticonvulsivanti, barbiturici, colestipolo, colestiramina, orlistat (...) The expected results should be presented as a list of entities for each of the classes of information about each drug. To obtain the result lists, we consider the annotated terms and their simplifications as unique entities e.g., the span livelli aumentati di calcio nel sangue (ipercalcemia) (elevated levels of calcium in the blood (hypercalcemia)) is listed as two separate entities that are livelli aumentati

5. Limitations One important limitation of the DIMMI dataset is the dis

claimer provided by the Italian Medicines Agency (AIFA) regarding the content available on their website in section A. Disclaimer7. AIFA states that all the information and services ofered on their website are provided "as is" and "with all faults". The Italian Medicines Agency, therefore, does not provide any kind of warranty, either explicit or implied, regarding the content, including, without limitation, the legality, ownership, suitability, or fitness for particular purposes or uses.

6https://spacy.io/models/it#it_core_news_lg 7https://www.aifa.gov.it/en/copyright

This disclaimer from the data source raises concerns By making the DIMMI corpus available under the CCabout the reliability and quality of the patient informa- BY 4.0 license, the dataset can be freely accessed, utilized, tion leaflets (PILs) that were used to construct the DIMMI and built upon by the scientific community, contributcorpus. While the dataset has been carefully curated ing to the advancement of research and applications in and annotated, the underlying data may contain errors, the field of biomedical text mining and pharmacological inaccuracies, or other issues that are not explicitly ac- information extraction. knowledged by the original provider. Researchers and developers using the DIMMI dataset should be aware of this limitation and exercise caution when relying on the Acknowledgments information contained within the corpus, particularly for critical applications or decision-making processes.

Luca Giordano has been supported by Borsa di Studio

GARR "Orio Carlini" 2023/24 - Consortium GARR, the National Research and Education Network.

6. Ethical issues Ethical considerations are crucial when working with

a dataset that contains sensitive information from PILs. The DIMMI corpus, which is derived from the AIFA (Italian Medicines Agency) Database, must be handled with the utmost care and respect for individual privacy, data protection, and the diversity of the target population.

Additionally, the use of the DIMMI corpus for the development and evaluation of natural language processing models must be guided by ethical principles that consider the diversity of the target population. The models trained on this data should be designed and deployed in a way that respects individual privacy, avoids potential misuse or discrimination, and ultimately benefits the public good, regardless of ethnicity or age. Careful consideration should be given to the potential societal impact of the applications built upon the DIMMI dataset, ensuring that they are inclusive and equitable.

By upholding the ethical standards in the handling and utilization of the DIMMI corpus, the research community can ensure that the valuable pharmacological information contained in the PILs is leveraged responsibly and in a manner that prioritizes the well-being of patients and the general public, while respecting the diversity of the target population.

7. Data license and copyright issues The DIMMI corpus has been created using the patient in

formation leaflets (PILs) from the AIFA (Italian Medicines Agency) Database. As reported in the Web site8, the distribution license used by AIFA for these data is the Creative Commons Attribution (CC-BY) license, version 4.0. This license allows third parties to distribute, modify, adapt, and use the data, even for commercial purposes, with the sole requirement of providing attribution to the original source.

8https://www.aifa.gov.it/en/copyright

[1]

W. H.

Shrank ,

Avorn , Educating patients about their medications: the potential and limitations of written drug information , Health afairs 26 ( 2007 ) 731 - 740 .

[2]

Rodríguez ,

Azarola ,

Lorda ,

Cantalejo ,

Danet , et al., Quality improvement of health information included in drug information leaflets. patient and health professional expectations , Atencion primaria 42 ( 2009 ) 22 - 27 .

[3]

Á . Piñero-López , P.

Modamio , C. F.

Lastra , E. L.

Mariño , Readability analysis of the package leaflets for biological medicines available on the internet between 2007 and 2013: an analytical longitudinal study , Journal of medical Internet research 18 ( 2016 ) e100 .

[4]

Segura-Bedmar ,

Martínez , Simplifying drug package leaflets written in spanish by using word embedding , Journal of biomedical semantics 8 ( 2017 ) 1 - 9 .

[5]

Yuan ,

Bao ,

Yuan ,

Shen ,

Chen ,

Xie ,

Zhao ,

Chen ,

Zhang ,

Shen , et al., Large language models illuminate a progressive pathway to artificial healthcare assistant: A review , arXiv preprint arXiv:2311 . 01918 ( 2023 ).

[6]

A. B.

Abacha , E. Agichtein,

Pinter , D. DemnerFushman, Overview of the medical question answering task at trec 2017 liveqa ., in: TREC , 2017 , pp. 1 - 12 .

[7]

A. B.

Abacha ,

Mrabet ,

Sharp ,

T. R.

Goodwin ,

S. E.

Shooshan , D. Demner-Fushman, Bridging the gap between consumers' medication questions and trusted answers , in: MEDINFO 2019 : Health and Wellbeing e-Networks for All , IOS Press, 2019 , pp. 25 - 29 .

[8]

Nguyen ,

Karimi ,

Rybinski ,

Xing , Medredqa for medical consumer question answering: Dataset, tasks, and neural baselines , in: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2023 , pp. 629 - 648 .

[9]

Filannino , Ö. Uzuner, Advancing the state of the art in clinical natural language processing through shared tasks , Yearbook of medical informatics 27 ( 2018 ) 184 - 192 .

[10]

Vaishya ,

Misra ,

Vaish , Chatgpt: Is this version good for healthcare and research?, Diabetes & Metabolic Syndrome: Clinical Research & Reviews 17 ( 2023 ) 102744 .

[11]

Lee ,

Bubeck ,

Petro , Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine , New England Journal of Medicine 388 ( 2023 ) 1233 - 1239 .

[12]

Gilbert ,

Harvey ,

Melvin , E. Vollebregt,

Wicks , Large language model ai chatbots require approval as medical devices , Nature Medicine 29 ( 2023 ) 2396 - 2398 .

[13]

Attanasio ,

Basile ,

Borazio ,

Croce ,

Francis ,

Gili , E. Musacchio,

Nissim ,

Patti ,

Rinaldi ,

Scalena , CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian , in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), Pisa, Italy, December 4 - December 6, 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2024 .

[14]

Giordano ,

M. P.

Di Buono , Large language models as drug information providers for patients , in: Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING 2024 , 2024 , pp. 54 - 63 .

[15]

Hripcsak ,

A. S.

Rothschild , Agreement, the fmeasure, and reliability in information retrieval , Journal of the American medical informatics association 12 ( 2005 ) 296 - 298 .

[16]

Deleger ,

Li ,

Lingren ,

Kaiser ,

Molnar ,

Stoutenborough ,

Kouril ,

Marsolo ,

Solti , et al., Building gold standard corpora for medical natural language processing tasks , in: AMIA Annual Symposium Proceedings , volume 2012 , American Medical Informatics Association, 2012 , p. 144 .

[17]

Grouin ,

Rosset ,

Zweigenbaum ,

Fort ,

Galibert , L. Quintard, Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview , in: Proceedings of the 5th linguistic annotation workshop , 2011 , pp. 92 - 100 .