DIMMI - Drug InforMation Mining in Italian: A CALAMITA
                                Challenge
                                Raffaele Manna1,† , Maria Pia di Buono1,*,† and Luca Giordano1,†
                                1
                                    University of Naples "L’Orientale", via Duomo 219, 80139 Napoli, Italy


                                                 Abstract
                                                 Patients’ knowledge about drugs and medications is crucial as it allows them to administer them safely. This knowledge
                                                 frequently comes from written prescriptions, patient information leaflets (PILs), or from reading drug Web pages. DIMMI
                                                 (Drug InforMation Mining in Italian) is a challenge aiming at evaluating the proficiency of Large Language Models in extracting
                                                 drug-specific information from PILs. The challenge seeks to advance the understanding of effectiveness in processing complex
                                                 medical information in Italian, and to enhance drug information extraction and pharmacovigilance efforts. Participants are
                                                 provided with a dataset of 600 Italian PILs and the objective is to develop models capable of accurately answering specific
                                                 questions related to drug dosage, usage, side effects, drug-drug interactions. The challenge should be approached as an
                                                 information extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understanding
                                                 or through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by the
                                                 models will be compared against the gold standard (GS), created to establish a reliable, accurate, and a comprehensive set
                                                 of answers against which participant submissions can be evaluated. For each drug and each information category, the GS
                                                 contains the correct information extracted from the leaflets through a manual annotation.

                                                 Keywords
                                                 Patient information leaflets, Information extraction, Large Language Models, Italian


                                1. Introduction and Motivation                                                                          are not easy to understand.
                                                                                                                                        Recently, there has been a growing interest in the uti-
                                Patients’ knowledge about drugs and medications is cru- lization of Large Language Models (LLMs) within the
                                cial as it allows them to administer them safely. This medical field to improve various aspects of healthcare, in-
                                knowledge frequently comes from written prescriptions, cluding medical education and clinical decision-making
                                patient information leaflets (PILs), or from reading drug support [5]. Several specialized medical LLMs have been
                                Web pages. Nevertheless, this information has been de- developed through novel pre-training methodologies or
                                scribed as often inconsistent, incomplete, and difficult for enhancements of existing models. Moreover, several eval-
                                patients to read and understand [1]. Despite the fact that uation campaigns have been undertaken to evaluate the
                                in 2009 the European Commission issued guidelines1 efficacy of natural language processing models in facilitat-
                                to recommend the publication of patient information ing knowledge retrieval for clinicians and patients alike.
                                leaflets with accessible and understandable information Examples of such campaigns are the 1) Medical Question
                                for patients, several scholars [2, 3, 4] account for the Answering Task at TREC-2017 LiveQA [6] and subsequent
                                absence of improvement in the readability of such docu- studies [7], which led to two datasets, LiveQA and Med-
                                ments. Thus, educating patients about their medications icationQA; 2) the tasks on Medical Consumer Question
                                seems to be a challenging task due to the linguistic na- Answering proposed by Nguyen et al. [8] based on their
                                ture of drug written information, which includes a high dataset MedRedQA. Both campaigns have contributed
                                presence of specialized terms used to describe adverse significantly to bridging the gap between consumers’
                                drug reactions, diseases and other medical concepts that medication questions and trusted answers, and, more
                                                                                                                                        generally, to the development of resources tailored to
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, healthcare information retrieval. For a thorough survey
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.                                                                                 of evaluation campaigns on clinical natural language pro-
                                †
                                  These authors contributed equally.                                                                    cessing refer to Filannino and Uzuner [9].
                                $ raffaele.manna@unior.it (R. Manna); mpdibuono@unior.it                                                The application of LLMs as patient assistants to support
                                (M. P. d. Buono); giordanoluca.uni@gmail.com (L. Giordano)                                              drug knowledge and ease their administration seems very
                                 0009-0006-6285-8557 (R. Manna); 0009-0009-9372-3323                                                   attractive, however it needs to be evaluated carefully
                                (M. P. d. Buono); 0009-0002-3048-4408 (L. Giordano)                                                     due to the presence of model hallucinations, potentially
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License

                                1
                                           Attribution 4.0 International (CC BY 4.0).
                                  GUIDELINE ON THE READABILITY OF THE LABELLING AND
                                                                                                                                        causing medical malpractice [10], as any concealed in-
                                  PACKAGE LEAFLET OF MEDICINAL PRODUCTS FOR HUMAN accuracies in diagnoses and health advice could lead to
                                  USE - European Commission (2009).                                                                     severe outcomes [11]. For these reasons, in the evolving
                                  https://health.ec.europa.eu/document/download/                                                        landscape of Artificial Intelligence (AI) applications in
                                    d8612682-ad17-40e3-8130-23395ec80380_en


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
medicine, considerations have been raised regarding the       represents a subset of 600 entries randomly selected from
regulatory approval of LLMs as medical devices, high-         the D-LeafIT corpus.
lighting the ethical and legal dimensions associated with        It is worth stressing that the information is extracted
deploying such technologies in healthcare settings [12].      from pdf files and converted into texts, this means that
   To delve deeper into this topic, within the CALAMITA       some errors and typos may occur. Furthermore, the orig-
campaign [13], we introduce DIMMI (Drug InforMation           inal D-LeafIT presents some data noise, e.g., the pres-
Mining in Italian), a challenge centered on evaluating the    ence of paratext, and wrong encoding from pdf files. To
proficiency of LLMs in extracting drug-specific informa-      fix these issues, we perform a cleaning procedure as a
tion from Italian PILs.                                       pre-processing phase, to obtain the final dataset. The
By this, the task aims at contributing to the development     procedure is mainly automatic and based on recurrent
of AI systems for enhancing drug information extraction       patterns, so that some of the aforementioned issues could
and pharmacovigilance efforts, specifically for the Italian   be still present. The dataset pre-processing phase can be
language.                                                     summarized in two main steps, that are:

                                                                      • Correcting the separation of each leaflets by iden-
2. DIMMI                                                                tifying regular patterns which indicate the begin-
                                                                        ning/end of a unique leaflet.
As DIMMI seeks to advance the understanding of LLM                    • Removing additional information about the is-
effectiveness in processing complex medical information                 sue date, the pharmaceutical company, and the
in Italian, participants are provided with the complete                 marketing authorization.
leaflets for each drug and the objective is to develop mod-
els capable of accurately answering specific questions Additionally, we notice the presence of several cases of
related to a drug, such as its dosage, usage, etc.          duplicate entries, due to different reasons, as described
The challenge should be approached as an information below:
extraction task through a zero-shot mode, purely based
                                                                1. Same drug name, same dosage form, same ingre-
on the model pre-existing knowledge and understand-
                                                                   dient amount, different issue dates → These
ing or through in-context learning (Retrieval-Augmented
                                                                   cases indicate that the leaflet has been updated
Generation (RAG) or few-shot mode). The answers gen-
                                                                   and all the versions are recorded into the AIFA
erated by the models will be compared against the gold
                                                                   repository. In such cases, on the basis of their ID,
standard (GS), created to establish a reliable, accurate,
                                                                   the less recent leaflet has been removed.
and comprehensive set of answers against which partic-
                                                                2. Same drug name, same dosage form, different in-
ipant submissions can be evaluated. For each drug and
                                                                   gredient amount → These cases may present, or
each information category (e.g., dosage, usage, side ef-
                                                                   not, the same information leaflets. We do not re-
fects, drug-drug interactions), the GS contains the correct
                                                                   move the duplicate entries, even though they
information extracted from the leaflets, manually anno-
                                                                   present the same information about the classes
tated according to some categories described in Section
                                                                   we are interested in.
4.1.
                                                                3. Same drug name, different dosage form, same
                                                                   ingredient amount → These duplicates are not
3. Data description                                                removed as dosage information can be differen-
                                                                   tiated on the basis of the drug form.
3.1. Origin of data                                             4. Same drug name, same dosage form, same ingre-
                                                                   dient amount, different pharmaceutical com-
The challenge dataset is derived from the D-LeafIT Cor-
                                2                                  pany - These duplicates are removed and just
pus [14], available on GitHub , made up of 1819 Italian
                                                                   one entry is kept. We usually prefer keeping the
drug package leaflets. The corpus has been created ex-
                                                                   one reporting in the name ’DOC generici’. If
tracting PILs available on the Italian Agency for Medi-
                                                          3        this is not possible, we keep the first occurrence.
cations (Agenzia Italiana del Farmaco - AIFA) website ,
among which 1439 refer to generic drugs and 380 to class
A drugs.                                                    3.2. Data format
In the original corpus, the generic drug leaflets amount to The whole leaflets are provided in the dataset, so that the
6,154,007 tokens while the class A to 1,650,879 tokens, for context is available. Additionally we provide the drug
a total amount of 7,804,886 tokens. The DIMMI dataset name for each leaflet. The final dataset, released 4 as a .tsv
                                                              (tab-separated values) format, contains four columns. For
2
  https://github.com/unior-nlp-research-group/D-LeafIT
3                                                             4
  https://www.aifa.gov.it/en/home                                 https://huggingface.co/datasets/RafaMann/DIMMI
        ID     ID_Loc       Drug_Name         Text
        119    119_276      BOTAM             BOTAM 0,4 mg capsule (...) Tamsulosina cloridrato Medicinale equivalente (...).
Table 1
Example of a DIMMI entry


each entry we present an ID, an ID_LOC which indicates                      2. Per cosa si usa {drug_name}?              (What is
the id location in the original corpus, the drug name                          {drug_name} used for?) - to extract the usage
(without any reference to the ingredient amount, the                        3. Qual è la posologia raccomandata per
dosage form, and the pharmaceutical company), and the                          {drug_name}?        (What is the recommended
leaflet text (Table 1).                                                        dosage for {drug_name}?) - to extract the dosage
   Participants in the DIMMI challenge are required to                      4. Quali sono gli effetti collaterali di {drug_name}?
use LLMs to extract the following information from the                         (What are the potential side effects of taking
PILs text: ’Molecule’, ’Usage’, ’Dosage’, ’Drug Interaction’,                  {drug_name}?) - to extract side effects
and ’Side Effect’. These information must be provided                       5. Con quali medicinali interagisce {drug_name}?
as output in a structured format such as TSV or JSON,                          (What are the drug interaction of {drug_name}?)
with reference to each ID and drug name contained in the                       - to extract the interaction with other drugs
evaluation dataset. The information extracted for each
ID and drug name with reference to ’Molecule’, ’Usage’,                 The latter type of prompt aims at extracting all the rele-
’Dosage’, ’Drug Interaction’, and ’Side Effect’ must be                 vant information with a specific instruction to help the
represented in the form of a list of strings (see Section               model understand the expected output structure and fa-
4.2).                                                                   cilitates extraction as it follows:
The evaluation dataset for the DIMMI corpus contains
                                                                             • Fornisci le seguenti informazioni su {drug_name}:
columns for the following entity types: ’Molecule’, ’Us-
                                                                               Molecola:
age’, ’Dosage’, ’Drug Interaction’, and ’Side Effect’. For
                                                                               Uso:
each instance (drug leaflet) in the DIMMI corpus, these
                                                                               Posologia:
entity-specific columns are populated with a list of
                                                                               Effetti collaterali:
strings, representing the annotated entities of the corre-
                                                                               Interazioni con altri medicinali:
sponding type.
                                                                               (Provide the following information about
The ’Molecule’ column will contain a list of the unique
                                                                               {drug_name}:
molecular entities mentioned in the text, while the ’Usage’
                                                                               Molecule:
column will include a list of the specific uses or indica-
                                                                               Usage:
tions for the drug. The ’Dosage’ column will hold a list of
                                                                               Dosage:
the textual spans describing the dosage, administration,
                                                                               Side Effects:
or regimen information. The ’Drug Interaction’ column
                                                                               Drug interaction:)
will contain a list of the potential interactions with other
drugs, and the ’Side Effect’ column will include a list of
the adverse effects associated with the drug.                           3.4. Dataset statistics
                                                                        As mentioned before, the final dataset is composed by
3.3. Prompting                                                          600 unique PILs in Italian, providing a comprehensive
                                                                        dataset for the challenge. The documents in the DIMMI
For each drug in the dataset, we evaluate the results from
                                                                        dataset exhibit a wide range of lengths (Table 2), with the
two types of zero-shot prompts in Italian, i.e., specific
                                                                        shortest document containing 363 tokens and the longest
task-focused prompts and structured prompts.
                                                                        extending to 11,730 tokens. This range in token count
                                                                        directly corresponds to the word count, indicating that
   The former type is composed of five questions for each
                                                                        each word is treated as a single token in this analysis. On
of the information type we want to extract, as reported
                                                                        average, each document contains approximately 2,520
below5 :
                                                                        words, with a standard deviation of 848 words, indicating
        1. Qual è la molecola di {drug_name}? (What is                  moderate variability in document length. The distribu-
           the molecule of {drug_name}?) - to extract the               tion of document lengths is further characterized by the
           molecule                                                     25th, 50th (median), and 75th percentiles, which are 1,960,
                                                                        2,448, and 2,980.75 words, respectively.
5
    It is worth stressing that in the prompt examples {drug_name} is       In total, the corpus contains 1,511,724 words and to-
    not a masked word, it represents a placeholder to indicate one of   kens. The lexical diversity of the corpus is reflected in the
    the entries from the column drug_name in DIMMI dataset.
                    DIMMI Statistics                            4.1. Gold Standard Creation
    num_documents         600
    mean_length           2519.54                               In order to evaluate the system results, we created a gold
    min_length            363                                   standard (GS), manually annotating the following cate-
    max_length            11730                                 gories: i) molecule; ii) dosage; iii) drug interaction; iv)
    std_length            848.41                                usage; v) side effect. For each of the aforementioned
    percentiles           .25:1960, .5: 2448, .75: 2980
                                                                classes we define some guidelines and specifications for
    total_words           1511724
                                                                the annotation, as summarised in the following para-
    mean_words_per_doc 2519.54
    total_tokens          1511724                               graphs.
    min_tokens            363
    max_tokens            11730                                 Molecule The category is used to identify the main
    unique_tokens         58901                                 ingredient(s) of the drug. In some cases, the bulking
    type_token_ratio      .038                                  agent(s) may be reported together with the molecule(s).
Table 2                                                         These are not included in the molecule class.
DIMMI statistics
                                                           Dosage information This class refers to the recom-
                                                           mended dosage for drug administration. We do not
58,901 unique tokens identified, resulting in a type-token annotate the treatment duration neither the maximum
ratio (TTR) of 0.0390. This relatively low TTR suggests a dosage in the dosage information.
high degree of repetition within the text, which is typi- For dosage information we distinguish between dosage
cal for technical and regulatory documents such as drug for children and adults. We do not distinguish dosage
package leaflets. Importantly, there are no empty docu- for infants or elders (the former is annotated as dosage
ments in the corpus, ensuring that all entries contribute information for children, the latter as dosage information
meaningful content to the dataset.                         for adults, as reported below).
                                                           When the same dosage can be used for both adults and
                                                           children, the general dosage information category is
4. Evaluation metrics                                      applied.
We will evaluate the results using accuracy, precision,
                                                                Example:
recall and F-1 score using a gold standard as benchmark
                                                                10 mg una volta al giorno negli adulti e nei bambini di età
(see Section 4.1).
                                                                uguale o superiore ai 10 anni
   The details for each metric are provided below:
                                                                (10 mg once a day in adults and children aged 10 years
     • Precision metric: For example: Dosage: If the            or older)
       model extracts "200mg-400mg every 4-6 hours"
       and this is correct, the precision for dosage is         Furthermore, dosage information could be differ-
       100%; Side Effects: If the model extracts "Stom-         entiated on the basis of age/weight. In such cases,
       ach upset, nausea" and this is partially correct         unless dosages for adults and children are explicitly
       (missing other side effects), the precision for side     differentiated, we always use the general category
       effects might be 50% (depending on how many              dosage.
       side effects are correctly identified);
     • Recall metric: For example: Dosage: If the cor-          Example:
       rect dosage is "200mg-400mg every 4-6 hours"             Adulti, anziani e bambini di età pari o superiore a 12 anni
       and the model extracts only "200mg-400mg," the           con un peso corporeo pari o superiore a 50 chilogrammi
       recall for dosage is 50%. Side Effects: If the correct   (kg): • da 1 a 2 g una volta al giorno a seconda della
       side effects are "Stomach upset, nausea, dizziness,      gravità e del tipo di infezione
       headache" and the model extracts "Stomach upset,         (Adults, elderly, and children aged 12 years and older
       nausea," the recall for side effects is 50%.             with a body weight of 50 kilograms (kg) or more: • 1 to
     • F1-score metric: A balanced measure of preci-            2 g once a day, depending on the severity and type of
       sion and recall. A higher F1-score indicates better      infection.)
       performance.
     • Accuracy: The overall percentage of correct ex-          Dosage for infants can be expressed through a co-
       tractions across all classes. As far as this metric      reference to some other dosage, e.g., for adults or
       is concerned, we also evaluate the class-Level           children, sometimes with a different time schedule, as
       Accuracy, as the accuracy for each specific class        in lo stesso dosaggio sopra descritto ma somministrato
       separately.                                              una volta ogni due giorni (The same dosage as described
above, but administered once every two days.). Unless            the minimal span that conveys the information about
the dosage is explicitly mentioned, we do not annotate           drug interactions, as follows:
these spans, as the information is context-dependent.                1. Molecule
The treatment of specific diseases might require different
                                                                     2. Drug class
dosages for the same drug. When they are reported in
                                                                     3. Drug use
the leaflet, following a minimum span principle, we
annotate all the dosages without any specification about         The aforementioned hierarchy helps us identify the span
the disease. Due to the aforementioned annotation                to be annotated. When included, drug names are always
choice, the annotation results will be a set of dosage           annotated.
information, as in the following example (annotated              When the interaction information is reported with the
spans reported in bold face).                                    specific pharmaceutical form (e.g., eritromicina inietta-
                                                                 bile), only the minimal possible span is annotated, i.e.,
Example:                                                         eritromicina.
Aspergillosi: - 2 capsule una volta al giorno per un             In some cases, examples of interacting molecules or drug
periodo di 2-5 mesi; (...) Candidosi: 1-2 capsule 1 volta        names are provided alongside the drug class or use (e.g.,
al giorno per un periodo da 3 settimane a 7 mesi (...)           medicinali usati per il trattamento dell’HIV AIDS, per es-
Criptococcosi non meningea: 2 capsule una volta al               empio ketoconazolo e itraconazolo - medicines used for
giorno per un periodo dai 2 mesi ad 1 anno (...)                 the treatment of HIV/AIDS, for example, ketoconazole
                                                                 and itraconazole). In these instances, we annotate both,
When the same dosage can be applied in more                      as the list of drugs and molecules may not be exhaustive.
than one cases, span duplicates may be present (e.g., 2          If the list is exhaustive, we do not annotate the general
capsule una volta al giorno). In the final GS, these are         reference to the drug use; we only annotate the drug
removed so that only one span for each type is kept.             molecules or names.
Some drugs must be administered according to a                   Interactions with some other molecules can be condi-
schedule that spans different time periods, with or              tioned by the taken amount, e.g. cimetidina, preso in
without dosage variations. In such cases, we annotate            dosi giornaliere superiori a 800mg (cimetidine, taken in
only the initial recommended dosage.                             daily doses greater than 800 mg). Also in these cases the
                                                                 molecule name is the only span annotated.
In some cases, the posology section does not pro-                Some interacting drugs are reported as the general drug
vide specific dosage information and instead includes a          class, together with a plain language explanation and a
general recommendation to consult a doctor. In these             subclass specification, as in the following example
instances, we consider the information to be missing and         diuretici (compresse per urinare in particolare quelli chia-
do not annotate the general statement.                           mati risparmiatori di potassio) (diuretics (tablets for
                                                                 urination, particularly those called potassium-sparing))
Drug interaction As for drug interactions„ we anno-              As the molecule is not noted, we do annotate both the
tate the name of molecules and drugs when they are               general class and sublcass (both in bold face in the previ-
available. In some cases, the information about drug in-         ous excerpt).
teraction is reported as a general reference to the use of       Additionally, also food and beverage can interact with
some drugs (e.g., medicinali per abbassare la pressione -        drugs, e.g., pompelmo, alcol (grapefruit, alcohol). We opt
medicines to lower blood pressure). In such instances,           not to include these substances within the drug interac-
as we cannot identify the specific molecule or drug, we          tion class, as we want to focus only on the pharmaceutical
annotate the general reference. Information about drug           drug interaction.
interactions may also appear as a reference to certain           Drug interaction information are considered missing
types of relationships with other molecules, as in derivati      when there is only a general sentence to the fact that
della fenotiazina (phenothiazine derivatives). For our an-       the use of any further drug should be reported.
notations, we omit additional information and select the
minimal span, in the aforementioned example, fenotiaz-           Usage With respect to usage, we consider the mini-
ina (phenothiazine).                                             mal possible span, which indicates the disease treated
Similarly, when the information pertains to the drug             by the specific drug. Thus, for instance, in the sen-
class instead of reporting the molecule, e.g., lassativi (lax-   tence {drug_name} è usato nel trattamento della gotta
atives), we annotate the minimal span, even though in            ({drug_name} is used in the treatment of gout), we anno-
some cases the drug use is specified, e.g., medicinali usati     tate only gotta (gout).
per trattare la stipsi (medicines used to treat constipa-        In other cases, some examples of usage may be reported
tion).                                                           as in traumi (ad esempio causati dallo sport) (injuries (for
We apply a hierarchical priority to identify and annotate        example, those caused by sports)). As those cases are not
representative enough of usages, we do not include them        patient/disease type, e.g., Se è HIV positivo può mostrare
in the annotation, so in the previous excerpt we annotate      effetti indesiderati (If you are HIV positive, you may
just traumi (injuires).                                        experience side effects). In such cases, symptoms are
Within the usage section, sometimes the use of plain text      annotated without any further specification.
is reported together with reference to the specific disease,   If duplicates are presented, those are not annotated or
e.g., meningite cirptococcica - un’infezione micotica del      removed in the post-processing phase, so that just one
cervello (...). We always annotate the specific term for the   entry for symptom type is recorded in the GS.
disease and discard the plain text description.                Sometimes, side effects are grouped by indicating the
When the generic disease class is presented, e.g., infezioni   general area (e.g., organ or functionality) affected, e.g.,
cutanee (skin infections), followed by a non exhaustive        nervous system disorders. The information might be
list of examples, we annotate just the generic use.            followed by a list of specific side effects. When this is
                                                               the case, we discard the general information in favor of
Side effects This class indicates all the possible side        the most specific one.
effects caused by the drug consumption. In PILs, this          It is worth stressing that other information may be
type of information is generally grouped on the basis          presented in PILs, for instance Precautions for use. As
of the number of people affected by the side effects to        we are not interested in this type of information, we do
identify different diffusion levels, e.g., very common side    not annotate such sections.
effects, very rare side effects. We do not differentiate
among the diffusion levels and consider all the side          Inter-Annotator Agreement The annotation has
effects belonging to the same class side_effect. In           been performed by three people with computational lin-
some cases, side effects affecting other subjects than the    guistic backgrounds and different levels of expertise. An
person consuming the drug are reported. For instance,         initial inter-annotator agreement has been evaluated af-
some drugs can affect the fetus as in the following           ter the first draft of guidelines has been created. Border-
excerpt.                                                      line cases and issues have been collected by each of the
                                                              annotators and subsequently discussed and solved. The
Example:                                                      guidelines have been updated accordingly and a second
(...) Se assume Ricap durante le ultime fasi della gravi- round of annotation has been performed in order to com-
danza, il suo bambino potrebbe manifestare i seguenti pute the final inter-annotator agreement.
sintomi: problemi a respirare, colorito bluastro o violaceo The annotation round for evaluating the final inter-
della pelle, convulsioni (...).                               annotator agreement has been performed on a subset
[(...) If you take Ricap during the later stages of of 60 leaflets.
pregnancy, your baby may experience the following                The results, calculated before the post-processing
symptoms: breathing problems, bluish or purplish skin phase, show a complete agreement on the molecule class
discoloration, seizures (...)]                                among all the annotators, while for the remaining classes
                                                              the agreement spans from .61 for posology and .80 for
We do not annotate these secondary side effects side effects (Table 3).
and the ones derived from drug overdose.
When the side effect type is reported together with              Class                A1/A2 A1/A3 A2/A3 AVG
its symptoms we do include those within the class                Molecule             1        1         1        1
of side effects. For instance, in some cases a list of           Usage                .69      .67       .68      .68
                                                                 Posology             .61      .62       .66      .63
symptoms difficoltà respiratoria, riduzione della pressione
                                                                 Drug interaction .66          .66       .65      .66
sanguigna is combined with the general side effect               Side effects         .80      .76       .75      .78
reazioni allergiche. Each of them is annotated separately
and included into the list of side effects.                   Table 3
Similarly, we annotate both the plain language side IAA for the GS
effect and the term, as in problemi del flusso della bile
(colestasi) (bile flow problems (cholestasis)).                  To assess the inter-annotator agreement (IAA) for the
When the side effects are reported as worsening of an creation of the gold standard, we employed two different
already existing disease, e.g., aumentata perdita di capelli, metrics: pairwise F1 score [15, 16] and token-level agree-
we annotate the minimum possible span, i.e., perdita di ment percentage [17]. The pairwise F1 score was used to
capelli.                                                      calculate the IAA for the "Molecule" and "Usage" labels,
For drugs containing more than one molecule, side as the information contained in the text for these entities
effects are reported along with the side effects for each refers to unique and well-defined concepts. This metric
individual molecule. We annotate all of them.                 provides a balanced measure of the precision and recall
Side effects can be reported with reference to some of the annotations, allowing us to quantify the level of
agreement between annotators on the identification of              di calcio nel sangue and ipercalcemia.
these specific entities.                                           This choice aims at accounting for both entities as possi-
   On the other hand, for the "Dosage", "Drug Interaction",        ble correct answers.
and "Side Effect" classes, we opted to use the token-level         For instance, for the drug NATRILIX, the expected re-
agreement percentage as the IAA metric. This choice                sults are as it follows:
was motivated by the fact that these classes involve vari-
able text spans, which can be more challenging to align                     • Usage: pressione sanguigna elevata, ipertensione
between annotators. Before calculating the token-level                        arteriosa essenziale
agreement percentage, we performed preprocessing steps                      • Molecule: indapamide
on the annotated portions, removing punctuation marks                       • Dosage: 1 compressa al giorno
(such as - and • that indicate a list) and Italian stopwords                • Side_effect: eruzioni cutanee, bassi livelli di potas-
from the Spacy Italian language model6 . The token-level                      sio nel sangue, vomito, porpora ...
agreement percentage provides a more granular assess-                       • Drug_interaction: litio, chinidina, idrochinidina,
ment of the consistency in the identification of the rel-                     disopiramide (...)
evant text segments, which is crucial for the accurate
                                                                   For the drug Trevid, the correct answers would be:
extraction of these types of entities from the source doc-
uments.                                                                     • Usage: carenza di vitamina D
                                                                            • Molecule: colecalciferolo
GS Post-processing To ensure high consistency                               • Dosage: 3-4 gocce al giorno
among annotations and to remove additional informa-                         • Side_effect: livelli aumentati di calcio nel sangue,
tion that does not meet the specified annotation criteria,                    ipercalcemia, livelli aumentati di calcio nelle urine,
we perform a post-processing step. During this phase,                         ipercalciuria, debolezza, astenia, reazioni aller-
we review the GS, using recurring patterns and regular                        giche, appetito ridotto (...)
expressions to clean the data and correct errors. We also                   • Drug_interaction: anticonvulsivanti, barbiturici,
carry out manual cleaning to produce the final GS.                            colestipolo, colestiramina, orlistat (...)
For instance, when applicable, we remove the drug name
mentioned in the posology specification (e.g., one tablet          Since this is an information extraction task in a zero-shot
of drug_name once a day) so that only the general infor-           setting based on PILs, it is expected that LLMs will be
mation related to the molecule is retained.                        able to extract the exact terminology used in the differ-
The resulting evaluation dataset contains XXX annotated            ent sections of the PILs and provide a list of terms. The
molecules, XXX drug interactions, XXX usage informa-               performance will be evaluated based on the metrics de-
tion, and XXX side-effects (Table 4).                              scribed in 4. Potential limitations in accurately assessing
                                                                   the performance of LLMs may arise from: 1) the vari-
      Class                 Tot. Entities        Unique Entities   ability in the models’ choice of terms to extract, and 2)
      Molecule              657                  657               the provision of terms and their simplifications as two
      Usage                 2159                 2113              entities. In these cases, forcing the LLMs to provide a
      Posology              831                  827
                                                                   more structured and less ambiguous output might help,
      Drug interaction      8617                 8458
      Side effect           36748                30313
                                                                   as currently the gold standard does not account for a
      Total                 49012                42368             set of synonyms to handle variability in the output, or
                                                                   employing additional metrics to address the second case.
Table 4
Annotated entities for each class
                                                                   5. Limitations
                                                              One important limitation of the DIMMI dataset is the dis-
4.2. Results                                                  claimer provided by the Italian Medicines Agency (AIFA)
The expected results should be presented as a list of en- regarding the content
                                                                             7
                                                                                      available on their website in section
tities for each of the classes of information about each      A. Disclaimer    . AIFA states that all the information and
drug. To obtain the result lists, we consider the annotated services offered on their website are provided "as is" and
terms and their simplifications as unique entities e.g., the "with all faults". The Italian Medicines Agency, therefore,
span livelli aumentati di calcio nel sangue (ipercalcemia) does not provide any kind of warranty, either explicit or
(elevated levels of calcium in the blood (hypercalcemia)) implied, regarding the content, including, without limi-
is listed as two separate entities that are livelli aumentati tation, the legality, ownership, suitability, or fitness for
                                                              particular purposes or uses.
6                                                                  7
    https://spacy.io/models/it#it_core_news_lg                         https://www.aifa.gov.it/en/copyright
   This disclaimer from the data source raises concerns           By making the DIMMI corpus available under the CC-
about the reliability and quality of the patient informa-      BY 4.0 license, the dataset can be freely accessed, utilized,
tion leaflets (PILs) that were used to construct the DIMMI     and built upon by the scientific community, contribut-
corpus. While the dataset has been carefully curated           ing to the advancement of research and applications in
and annotated, the underlying data may contain errors,         the field of biomedical text mining and pharmacological
inaccuracies, or other issues that are not explicitly ac-      information extraction.
knowledged by the original provider. Researchers and
developers using the DIMMI dataset should be aware of
this limitation and exercise caution when relying on the       Acknowledgments
information contained within the corpus, particularly for
                                                               Luca Giordano has been supported by Borsa di Studio
critical applications or decision-making processes.
                                                               GARR "Orio Carlini" 2023/24 - Consortium GARR, the
                                                               National Research and Education Network.
6. Ethical issues
Ethical considerations are crucial when working with
a dataset that contains sensitive information from PILs.       References
The DIMMI corpus, which is derived from the AIFA (Ital-
                                                                [1] W. H. Shrank, J. Avorn, Educating patients about
ian Medicines Agency) Database, must be handled with
                                                                    their medications: the potential and limitations of
the utmost care and respect for individual privacy, data
                                                                    written drug information, Health affairs 26 (2007)
protection, and the diversity of the target population.
                                                                    731–740.
   Additionally, the use of the DIMMI corpus for the de-
                                                                [2] P. Rodríguez, R. Azarola, S. Lorda, B. Cantalejo,
velopment and evaluation of natural language processing
                                                                    A. Danet, et al., Quality improvement of health
models must be guided by ethical principles that consider
                                                                    information included in drug information leaflets.
the diversity of the target population. The models trained
                                                                    patient and health professional expectations, Aten-
on this data should be designed and deployed in a way
                                                                    cion primaria 42 (2009) 22–27.
that respects individual privacy, avoids potential mis-
                                                                [3] M. Á. Piñero-López, P. Modamio, C. F. Lastra, E. L.
use or discrimination, and ultimately benefits the public
                                                                    Mariño, Readability analysis of the package leaflets
good, regardless of ethnicity or age. Careful considera-
                                                                    for biological medicines available on the internet
tion should be given to the potential societal impact of
                                                                    between 2007 and 2013: an analytical longitudinal
the applications built upon the DIMMI dataset, ensuring
                                                                    study, Journal of medical Internet research 18 (2016)
that they are inclusive and equitable.
                                                                    e100.
   By upholding the ethical standards in the handling and
                                                                [4] I. Segura-Bedmar, P. Martínez, Simplifying drug
utilization of the DIMMI corpus, the research community
                                                                    package leaflets written in spanish by using word
can ensure that the valuable pharmacological information
                                                                    embedding, Journal of biomedical semantics 8
contained in the PILs is leveraged responsibly and in a
                                                                    (2017) 1–9.
manner that prioritizes the well-being of patients and
                                                                [5] M. Yuan, P. Bao, J. Yuan, Y. Shen, Z. Chen, Y. Xie,
the general public, while respecting the diversity of the
                                                                    J. Zhao, Y. Chen, L. Zhang, L. Shen, et al., Large
target population.
                                                                    language models illuminate a progressive pathway
                                                                    to artificial healthcare assistant: A review, arXiv
7. Data license and copyright                                       preprint arXiv:2311.01918 (2023).
                                                                [6] A. B. Abacha, E. Agichtein, Y. Pinter, D. Demner-
   issues                                                           Fushman, Overview of the medical question an-
                                                                    swering task at trec 2017 liveqa., in: TREC, 2017,
The DIMMI corpus has been created using the patient in-
                                                                    pp. 1–12.
formation leaflets (PILs) from the AIFA (Italian Medicines
                                                                [7] A. B. Abacha, Y. Mrabet, M. Sharp, T. R. Goodwin,
Agency) Database. As reported in the Web site8 , the distri-
                                                                    S. E. Shooshan, D. Demner-Fushman, Bridging the
bution license used by AIFA for these data is the Creative
                                                                    gap between consumers’ medication questions and
Commons Attribution (CC-BY) license, version 4.0. This
                                                                    trusted answers, in: MEDINFO 2019: Health and
license allows third parties to distribute, modify, adapt,
                                                                    Wellbeing e-Networks for All, IOS Press, 2019, pp.
and use the data, even for commercial purposes, with the
                                                                    25–29.
sole requirement of providing attribution to the original
                                                                [8] V. Nguyen, S. Karimi, M. Rybinski, Z. Xing,
source.
                                                                    Medredqa for medical consumer question answer-
                                                                    ing: Dataset, tasks, and neural baselines, in: Pro-
8
    https://www.aifa.gov.it/en/copyright                            ceedings of the 13th International Joint Conference
     on Natural Language Processing and the 3rd Con-
     ference of the Asia-Pacific Chapter of the Associa-
     tion for Computational Linguistics (Volume 1: Long
     Papers), 2023, pp. 629–648.
 [9] M. Filannino, Ö. Uzuner, Advancing the state of the
     art in clinical natural language processing through
     shared tasks, Yearbook of medical informatics 27
     (2018) 184–192.
[10] R. Vaishya, A. Misra, A. Vaish, Chatgpt: Is this ver-
     sion good for healthcare and research?, Diabetes &
     Metabolic Syndrome: Clinical Research & Reviews
     17 (2023) 102744.
[11] P. Lee, S. Bubeck, J. Petro, Benefits, limits, and risks
     of gpt-4 as an ai chatbot for medicine, New England
     Journal of Medicine 388 (2023) 1233–1239.
[12] S. Gilbert, H. Harvey, T. Melvin, E. Vollebregt,
     P. Wicks, Large language model ai chatbots re-
     quire approval as medical devices, Nature Medicine
     29 (2023) 2396–2398.
[13] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
     naldi, D. Scalena, CALAMITA: Challenge the Abili-
     ties of LAnguage Models in ITAlian, in: Proceed-
     ings of the 10th Italian Conference on Computa-
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
     ber 4 - December 6, 2024, CEUR Workshop Proceed-
     ings, CEUR-WS.org, 2024.
[14] L. Giordano, M. P. Di Buono, Large language
     models as drug information providers for pa-
     tients,       in: Proceedings of the First Work-
     shop on Patient-Oriented Language Processing
     (CL4Health)@ LREC-COLING 2024, 2024, pp. 54–
     63.
[15] G. Hripcsak, A. S. Rothschild, Agreement, the f-
     measure, and reliability in information retrieval,
     Journal of the American medical informatics asso-
     ciation 12 (2005) 296–298.
[16] L. Deleger, Q. Li, T. Lingren, M. Kaiser, K. Molnar,
     L. Stoutenborough, M. Kouril, K. Marsolo, I. Solti,
     et al., Building gold standard corpora for medical
     natural language processing tasks, in: AMIA An-
     nual Symposium Proceedings, volume 2012, Ameri-
     can Medical Informatics Association, 2012, p. 144.
[17] C. Grouin, S. Rosset, P. Zweigenbaum, K. Fort,
     O. Galibert, L. Quintard, Proposal for an extension
     of traditional named entities: From guidelines to
     evaluation, an overview, in: Proceedings of the 5th
     linguistic annotation workshop, 2011, pp. 92–100.