=Paper=
{{Paper
|id=Vol-3878/126_calamita_long
|storemode=property
|title=DIMMI - Drug InforMation Mining in Italian: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/126_calamita_long.pdf
|volume=Vol-3878
|authors=Raffaele Manna,Maria Pia Di Buono,Luca Giordano
|dblpUrl=https://dblp.org/rec/conf/clic-it/MannaBG24
}}
==DIMMI - Drug InforMation Mining in Italian: A CALAMITA Challenge==
DIMMI - Drug InforMation Mining in Italian: A CALAMITA
Challenge
Raffaele Manna1,† , Maria Pia di Buono1,*,† and Luca Giordano1,†
1
University of Naples "L’Orientale", via Duomo 219, 80139 Napoli, Italy
Abstract
Patients’ knowledge about drugs and medications is crucial as it allows them to administer them safely. This knowledge
frequently comes from written prescriptions, patient information leaflets (PILs), or from reading drug Web pages. DIMMI
(Drug InforMation Mining in Italian) is a challenge aiming at evaluating the proficiency of Large Language Models in extracting
drug-specific information from PILs. The challenge seeks to advance the understanding of effectiveness in processing complex
medical information in Italian, and to enhance drug information extraction and pharmacovigilance efforts. Participants are
provided with a dataset of 600 Italian PILs and the objective is to develop models capable of accurately answering specific
questions related to drug dosage, usage, side effects, drug-drug interactions. The challenge should be approached as an
information extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understanding
or through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by the
models will be compared against the gold standard (GS), created to establish a reliable, accurate, and a comprehensive set
of answers against which participant submissions can be evaluated. For each drug and each information category, the GS
contains the correct information extracted from the leaflets through a manual annotation.
Keywords
Patient information leaflets, Information extraction, Large Language Models, Italian
1. Introduction and Motivation are not easy to understand.
Recently, there has been a growing interest in the uti-
Patients’ knowledge about drugs and medications is cru- lization of Large Language Models (LLMs) within the
cial as it allows them to administer them safely. This medical field to improve various aspects of healthcare, in-
knowledge frequently comes from written prescriptions, cluding medical education and clinical decision-making
patient information leaflets (PILs), or from reading drug support [5]. Several specialized medical LLMs have been
Web pages. Nevertheless, this information has been de- developed through novel pre-training methodologies or
scribed as often inconsistent, incomplete, and difficult for enhancements of existing models. Moreover, several eval-
patients to read and understand [1]. Despite the fact that uation campaigns have been undertaken to evaluate the
in 2009 the European Commission issued guidelines1 efficacy of natural language processing models in facilitat-
to recommend the publication of patient information ing knowledge retrieval for clinicians and patients alike.
leaflets with accessible and understandable information Examples of such campaigns are the 1) Medical Question
for patients, several scholars [2, 3, 4] account for the Answering Task at TREC-2017 LiveQA [6] and subsequent
absence of improvement in the readability of such docu- studies [7], which led to two datasets, LiveQA and Med-
ments. Thus, educating patients about their medications icationQA; 2) the tasks on Medical Consumer Question
seems to be a challenging task due to the linguistic na- Answering proposed by Nguyen et al. [8] based on their
ture of drug written information, which includes a high dataset MedRedQA. Both campaigns have contributed
presence of specialized terms used to describe adverse significantly to bridging the gap between consumers’
drug reactions, diseases and other medical concepts that medication questions and trusted answers, and, more
generally, to the development of resources tailored to
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, healthcare information retrieval. For a thorough survey
Dec 04 — 06, 2024, Pisa, Italy
*
Corresponding author. of evaluation campaigns on clinical natural language pro-
†
These authors contributed equally. cessing refer to Filannino and Uzuner [9].
$ raffaele.manna@unior.it (R. Manna); mpdibuono@unior.it The application of LLMs as patient assistants to support
(M. P. d. Buono); giordanoluca.uni@gmail.com (L. Giordano) drug knowledge and ease their administration seems very
0009-0006-6285-8557 (R. Manna); 0009-0009-9372-3323 attractive, however it needs to be evaluated carefully
(M. P. d. Buono); 0009-0002-3048-4408 (L. Giordano) due to the presence of model hallucinations, potentially
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
1
Attribution 4.0 International (CC BY 4.0).
GUIDELINE ON THE READABILITY OF THE LABELLING AND
causing medical malpractice [10], as any concealed in-
PACKAGE LEAFLET OF MEDICINAL PRODUCTS FOR HUMAN accuracies in diagnoses and health advice could lead to
USE - European Commission (2009). severe outcomes [11]. For these reasons, in the evolving
https://health.ec.europa.eu/document/download/ landscape of Artificial Intelligence (AI) applications in
d8612682-ad17-40e3-8130-23395ec80380_en
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
medicine, considerations have been raised regarding the represents a subset of 600 entries randomly selected from
regulatory approval of LLMs as medical devices, high- the D-LeafIT corpus.
lighting the ethical and legal dimensions associated with It is worth stressing that the information is extracted
deploying such technologies in healthcare settings [12]. from pdf files and converted into texts, this means that
To delve deeper into this topic, within the CALAMITA some errors and typos may occur. Furthermore, the orig-
campaign [13], we introduce DIMMI (Drug InforMation inal D-LeafIT presents some data noise, e.g., the pres-
Mining in Italian), a challenge centered on evaluating the ence of paratext, and wrong encoding from pdf files. To
proficiency of LLMs in extracting drug-specific informa- fix these issues, we perform a cleaning procedure as a
tion from Italian PILs. pre-processing phase, to obtain the final dataset. The
By this, the task aims at contributing to the development procedure is mainly automatic and based on recurrent
of AI systems for enhancing drug information extraction patterns, so that some of the aforementioned issues could
and pharmacovigilance efforts, specifically for the Italian be still present. The dataset pre-processing phase can be
language. summarized in two main steps, that are:
• Correcting the separation of each leaflets by iden-
2. DIMMI tifying regular patterns which indicate the begin-
ning/end of a unique leaflet.
As DIMMI seeks to advance the understanding of LLM • Removing additional information about the is-
effectiveness in processing complex medical information sue date, the pharmaceutical company, and the
in Italian, participants are provided with the complete marketing authorization.
leaflets for each drug and the objective is to develop mod-
els capable of accurately answering specific questions Additionally, we notice the presence of several cases of
related to a drug, such as its dosage, usage, etc. duplicate entries, due to different reasons, as described
The challenge should be approached as an information below:
extraction task through a zero-shot mode, purely based
1. Same drug name, same dosage form, same ingre-
on the model pre-existing knowledge and understand-
dient amount, different issue dates → These
ing or through in-context learning (Retrieval-Augmented
cases indicate that the leaflet has been updated
Generation (RAG) or few-shot mode). The answers gen-
and all the versions are recorded into the AIFA
erated by the models will be compared against the gold
repository. In such cases, on the basis of their ID,
standard (GS), created to establish a reliable, accurate,
the less recent leaflet has been removed.
and comprehensive set of answers against which partic-
2. Same drug name, same dosage form, different in-
ipant submissions can be evaluated. For each drug and
gredient amount → These cases may present, or
each information category (e.g., dosage, usage, side ef-
not, the same information leaflets. We do not re-
fects, drug-drug interactions), the GS contains the correct
move the duplicate entries, even though they
information extracted from the leaflets, manually anno-
present the same information about the classes
tated according to some categories described in Section
we are interested in.
4.1.
3. Same drug name, different dosage form, same
ingredient amount → These duplicates are not
3. Data description removed as dosage information can be differen-
tiated on the basis of the drug form.
3.1. Origin of data 4. Same drug name, same dosage form, same ingre-
dient amount, different pharmaceutical com-
The challenge dataset is derived from the D-LeafIT Cor-
2 pany - These duplicates are removed and just
pus [14], available on GitHub , made up of 1819 Italian
one entry is kept. We usually prefer keeping the
drug package leaflets. The corpus has been created ex-
one reporting in the name ’DOC generici’. If
tracting PILs available on the Italian Agency for Medi-
3 this is not possible, we keep the first occurrence.
cations (Agenzia Italiana del Farmaco - AIFA) website ,
among which 1439 refer to generic drugs and 380 to class
A drugs. 3.2. Data format
In the original corpus, the generic drug leaflets amount to The whole leaflets are provided in the dataset, so that the
6,154,007 tokens while the class A to 1,650,879 tokens, for context is available. Additionally we provide the drug
a total amount of 7,804,886 tokens. The DIMMI dataset name for each leaflet. The final dataset, released 4 as a .tsv
(tab-separated values) format, contains four columns. For
2
https://github.com/unior-nlp-research-group/D-LeafIT
3 4
https://www.aifa.gov.it/en/home https://huggingface.co/datasets/RafaMann/DIMMI
ID ID_Loc Drug_Name Text
119 119_276 BOTAM BOTAM 0,4 mg capsule (...) Tamsulosina cloridrato Medicinale equivalente (...).
Table 1
Example of a DIMMI entry
each entry we present an ID, an ID_LOC which indicates 2. Per cosa si usa {drug_name}? (What is
the id location in the original corpus, the drug name {drug_name} used for?) - to extract the usage
(without any reference to the ingredient amount, the 3. Qual è la posologia raccomandata per
dosage form, and the pharmaceutical company), and the {drug_name}? (What is the recommended
leaflet text (Table 1). dosage for {drug_name}?) - to extract the dosage
Participants in the DIMMI challenge are required to 4. Quali sono gli effetti collaterali di {drug_name}?
use LLMs to extract the following information from the (What are the potential side effects of taking
PILs text: ’Molecule’, ’Usage’, ’Dosage’, ’Drug Interaction’, {drug_name}?) - to extract side effects
and ’Side Effect’. These information must be provided 5. Con quali medicinali interagisce {drug_name}?
as output in a structured format such as TSV or JSON, (What are the drug interaction of {drug_name}?)
with reference to each ID and drug name contained in the - to extract the interaction with other drugs
evaluation dataset. The information extracted for each
ID and drug name with reference to ’Molecule’, ’Usage’, The latter type of prompt aims at extracting all the rele-
’Dosage’, ’Drug Interaction’, and ’Side Effect’ must be vant information with a specific instruction to help the
represented in the form of a list of strings (see Section model understand the expected output structure and fa-
4.2). cilitates extraction as it follows:
The evaluation dataset for the DIMMI corpus contains
• Fornisci le seguenti informazioni su {drug_name}:
columns for the following entity types: ’Molecule’, ’Us-
Molecola:
age’, ’Dosage’, ’Drug Interaction’, and ’Side Effect’. For
Uso:
each instance (drug leaflet) in the DIMMI corpus, these
Posologia:
entity-specific columns are populated with a list of
Effetti collaterali:
strings, representing the annotated entities of the corre-
Interazioni con altri medicinali:
sponding type.
(Provide the following information about
The ’Molecule’ column will contain a list of the unique
{drug_name}:
molecular entities mentioned in the text, while the ’Usage’
Molecule:
column will include a list of the specific uses or indica-
Usage:
tions for the drug. The ’Dosage’ column will hold a list of
Dosage:
the textual spans describing the dosage, administration,
Side Effects:
or regimen information. The ’Drug Interaction’ column
Drug interaction:)
will contain a list of the potential interactions with other
drugs, and the ’Side Effect’ column will include a list of
the adverse effects associated with the drug. 3.4. Dataset statistics
As mentioned before, the final dataset is composed by
3.3. Prompting 600 unique PILs in Italian, providing a comprehensive
dataset for the challenge. The documents in the DIMMI
For each drug in the dataset, we evaluate the results from
dataset exhibit a wide range of lengths (Table 2), with the
two types of zero-shot prompts in Italian, i.e., specific
shortest document containing 363 tokens and the longest
task-focused prompts and structured prompts.
extending to 11,730 tokens. This range in token count
directly corresponds to the word count, indicating that
The former type is composed of five questions for each
each word is treated as a single token in this analysis. On
of the information type we want to extract, as reported
average, each document contains approximately 2,520
below5 :
words, with a standard deviation of 848 words, indicating
1. Qual è la molecola di {drug_name}? (What is moderate variability in document length. The distribu-
the molecule of {drug_name}?) - to extract the tion of document lengths is further characterized by the
molecule 25th, 50th (median), and 75th percentiles, which are 1,960,
2,448, and 2,980.75 words, respectively.
5
It is worth stressing that in the prompt examples {drug_name} is In total, the corpus contains 1,511,724 words and to-
not a masked word, it represents a placeholder to indicate one of kens. The lexical diversity of the corpus is reflected in the
the entries from the column drug_name in DIMMI dataset.
DIMMI Statistics 4.1. Gold Standard Creation
num_documents 600
mean_length 2519.54 In order to evaluate the system results, we created a gold
min_length 363 standard (GS), manually annotating the following cate-
max_length 11730 gories: i) molecule; ii) dosage; iii) drug interaction; iv)
std_length 848.41 usage; v) side effect. For each of the aforementioned
percentiles .25:1960, .5: 2448, .75: 2980
classes we define some guidelines and specifications for
total_words 1511724
the annotation, as summarised in the following para-
mean_words_per_doc 2519.54
total_tokens 1511724 graphs.
min_tokens 363
max_tokens 11730 Molecule The category is used to identify the main
unique_tokens 58901 ingredient(s) of the drug. In some cases, the bulking
type_token_ratio .038 agent(s) may be reported together with the molecule(s).
Table 2 These are not included in the molecule class.
DIMMI statistics
Dosage information This class refers to the recom-
mended dosage for drug administration. We do not
58,901 unique tokens identified, resulting in a type-token annotate the treatment duration neither the maximum
ratio (TTR) of 0.0390. This relatively low TTR suggests a dosage in the dosage information.
high degree of repetition within the text, which is typi- For dosage information we distinguish between dosage
cal for technical and regulatory documents such as drug for children and adults. We do not distinguish dosage
package leaflets. Importantly, there are no empty docu- for infants or elders (the former is annotated as dosage
ments in the corpus, ensuring that all entries contribute information for children, the latter as dosage information
meaningful content to the dataset. for adults, as reported below).
When the same dosage can be used for both adults and
children, the general dosage information category is
4. Evaluation metrics applied.
We will evaluate the results using accuracy, precision,
Example:
recall and F-1 score using a gold standard as benchmark
10 mg una volta al giorno negli adulti e nei bambini di età
(see Section 4.1).
uguale o superiore ai 10 anni
The details for each metric are provided below:
(10 mg once a day in adults and children aged 10 years
• Precision metric: For example: Dosage: If the or older)
model extracts "200mg-400mg every 4-6 hours"
and this is correct, the precision for dosage is Furthermore, dosage information could be differ-
100%; Side Effects: If the model extracts "Stom- entiated on the basis of age/weight. In such cases,
ach upset, nausea" and this is partially correct unless dosages for adults and children are explicitly
(missing other side effects), the precision for side differentiated, we always use the general category
effects might be 50% (depending on how many dosage.
side effects are correctly identified);
• Recall metric: For example: Dosage: If the cor- Example:
rect dosage is "200mg-400mg every 4-6 hours" Adulti, anziani e bambini di età pari o superiore a 12 anni
and the model extracts only "200mg-400mg," the con un peso corporeo pari o superiore a 50 chilogrammi
recall for dosage is 50%. Side Effects: If the correct (kg): • da 1 a 2 g una volta al giorno a seconda della
side effects are "Stomach upset, nausea, dizziness, gravità e del tipo di infezione
headache" and the model extracts "Stomach upset, (Adults, elderly, and children aged 12 years and older
nausea," the recall for side effects is 50%. with a body weight of 50 kilograms (kg) or more: • 1 to
• F1-score metric: A balanced measure of preci- 2 g once a day, depending on the severity and type of
sion and recall. A higher F1-score indicates better infection.)
performance.
• Accuracy: The overall percentage of correct ex- Dosage for infants can be expressed through a co-
tractions across all classes. As far as this metric reference to some other dosage, e.g., for adults or
is concerned, we also evaluate the class-Level children, sometimes with a different time schedule, as
Accuracy, as the accuracy for each specific class in lo stesso dosaggio sopra descritto ma somministrato
separately. una volta ogni due giorni (The same dosage as described
above, but administered once every two days.). Unless the minimal span that conveys the information about
the dosage is explicitly mentioned, we do not annotate drug interactions, as follows:
these spans, as the information is context-dependent. 1. Molecule
The treatment of specific diseases might require different
2. Drug class
dosages for the same drug. When they are reported in
3. Drug use
the leaflet, following a minimum span principle, we
annotate all the dosages without any specification about The aforementioned hierarchy helps us identify the span
the disease. Due to the aforementioned annotation to be annotated. When included, drug names are always
choice, the annotation results will be a set of dosage annotated.
information, as in the following example (annotated When the interaction information is reported with the
spans reported in bold face). specific pharmaceutical form (e.g., eritromicina inietta-
bile), only the minimal possible span is annotated, i.e.,
Example: eritromicina.
Aspergillosi: - 2 capsule una volta al giorno per un In some cases, examples of interacting molecules or drug
periodo di 2-5 mesi; (...) Candidosi: 1-2 capsule 1 volta names are provided alongside the drug class or use (e.g.,
al giorno per un periodo da 3 settimane a 7 mesi (...) medicinali usati per il trattamento dell’HIV AIDS, per es-
Criptococcosi non meningea: 2 capsule una volta al empio ketoconazolo e itraconazolo - medicines used for
giorno per un periodo dai 2 mesi ad 1 anno (...) the treatment of HIV/AIDS, for example, ketoconazole
and itraconazole). In these instances, we annotate both,
When the same dosage can be applied in more as the list of drugs and molecules may not be exhaustive.
than one cases, span duplicates may be present (e.g., 2 If the list is exhaustive, we do not annotate the general
capsule una volta al giorno). In the final GS, these are reference to the drug use; we only annotate the drug
removed so that only one span for each type is kept. molecules or names.
Some drugs must be administered according to a Interactions with some other molecules can be condi-
schedule that spans different time periods, with or tioned by the taken amount, e.g. cimetidina, preso in
without dosage variations. In such cases, we annotate dosi giornaliere superiori a 800mg (cimetidine, taken in
only the initial recommended dosage. daily doses greater than 800 mg). Also in these cases the
molecule name is the only span annotated.
In some cases, the posology section does not pro- Some interacting drugs are reported as the general drug
vide specific dosage information and instead includes a class, together with a plain language explanation and a
general recommendation to consult a doctor. In these subclass specification, as in the following example
instances, we consider the information to be missing and diuretici (compresse per urinare in particolare quelli chia-
do not annotate the general statement. mati risparmiatori di potassio) (diuretics (tablets for
urination, particularly those called potassium-sparing))
Drug interaction As for drug interactions„ we anno- As the molecule is not noted, we do annotate both the
tate the name of molecules and drugs when they are general class and sublcass (both in bold face in the previ-
available. In some cases, the information about drug in- ous excerpt).
teraction is reported as a general reference to the use of Additionally, also food and beverage can interact with
some drugs (e.g., medicinali per abbassare la pressione - drugs, e.g., pompelmo, alcol (grapefruit, alcohol). We opt
medicines to lower blood pressure). In such instances, not to include these substances within the drug interac-
as we cannot identify the specific molecule or drug, we tion class, as we want to focus only on the pharmaceutical
annotate the general reference. Information about drug drug interaction.
interactions may also appear as a reference to certain Drug interaction information are considered missing
types of relationships with other molecules, as in derivati when there is only a general sentence to the fact that
della fenotiazina (phenothiazine derivatives). For our an- the use of any further drug should be reported.
notations, we omit additional information and select the
minimal span, in the aforementioned example, fenotiaz- Usage With respect to usage, we consider the mini-
ina (phenothiazine). mal possible span, which indicates the disease treated
Similarly, when the information pertains to the drug by the specific drug. Thus, for instance, in the sen-
class instead of reporting the molecule, e.g., lassativi (lax- tence {drug_name} è usato nel trattamento della gotta
atives), we annotate the minimal span, even though in ({drug_name} is used in the treatment of gout), we anno-
some cases the drug use is specified, e.g., medicinali usati tate only gotta (gout).
per trattare la stipsi (medicines used to treat constipa- In other cases, some examples of usage may be reported
tion). as in traumi (ad esempio causati dallo sport) (injuries (for
We apply a hierarchical priority to identify and annotate example, those caused by sports)). As those cases are not
representative enough of usages, we do not include them patient/disease type, e.g., Se è HIV positivo può mostrare
in the annotation, so in the previous excerpt we annotate effetti indesiderati (If you are HIV positive, you may
just traumi (injuires). experience side effects). In such cases, symptoms are
Within the usage section, sometimes the use of plain text annotated without any further specification.
is reported together with reference to the specific disease, If duplicates are presented, those are not annotated or
e.g., meningite cirptococcica - un’infezione micotica del removed in the post-processing phase, so that just one
cervello (...). We always annotate the specific term for the entry for symptom type is recorded in the GS.
disease and discard the plain text description. Sometimes, side effects are grouped by indicating the
When the generic disease class is presented, e.g., infezioni general area (e.g., organ or functionality) affected, e.g.,
cutanee (skin infections), followed by a non exhaustive nervous system disorders. The information might be
list of examples, we annotate just the generic use. followed by a list of specific side effects. When this is
the case, we discard the general information in favor of
Side effects This class indicates all the possible side the most specific one.
effects caused by the drug consumption. In PILs, this It is worth stressing that other information may be
type of information is generally grouped on the basis presented in PILs, for instance Precautions for use. As
of the number of people affected by the side effects to we are not interested in this type of information, we do
identify different diffusion levels, e.g., very common side not annotate such sections.
effects, very rare side effects. We do not differentiate
among the diffusion levels and consider all the side Inter-Annotator Agreement The annotation has
effects belonging to the same class side_effect. In been performed by three people with computational lin-
some cases, side effects affecting other subjects than the guistic backgrounds and different levels of expertise. An
person consuming the drug are reported. For instance, initial inter-annotator agreement has been evaluated af-
some drugs can affect the fetus as in the following ter the first draft of guidelines has been created. Border-
excerpt. line cases and issues have been collected by each of the
annotators and subsequently discussed and solved. The
Example: guidelines have been updated accordingly and a second
(...) Se assume Ricap durante le ultime fasi della gravi- round of annotation has been performed in order to com-
danza, il suo bambino potrebbe manifestare i seguenti pute the final inter-annotator agreement.
sintomi: problemi a respirare, colorito bluastro o violaceo The annotation round for evaluating the final inter-
della pelle, convulsioni (...). annotator agreement has been performed on a subset
[(...) If you take Ricap during the later stages of of 60 leaflets.
pregnancy, your baby may experience the following The results, calculated before the post-processing
symptoms: breathing problems, bluish or purplish skin phase, show a complete agreement on the molecule class
discoloration, seizures (...)] among all the annotators, while for the remaining classes
the agreement spans from .61 for posology and .80 for
We do not annotate these secondary side effects side effects (Table 3).
and the ones derived from drug overdose.
When the side effect type is reported together with Class A1/A2 A1/A3 A2/A3 AVG
its symptoms we do include those within the class Molecule 1 1 1 1
of side effects. For instance, in some cases a list of Usage .69 .67 .68 .68
Posology .61 .62 .66 .63
symptoms difficoltà respiratoria, riduzione della pressione
Drug interaction .66 .66 .65 .66
sanguigna is combined with the general side effect Side effects .80 .76 .75 .78
reazioni allergiche. Each of them is annotated separately
and included into the list of side effects. Table 3
Similarly, we annotate both the plain language side IAA for the GS
effect and the term, as in problemi del flusso della bile
(colestasi) (bile flow problems (cholestasis)). To assess the inter-annotator agreement (IAA) for the
When the side effects are reported as worsening of an creation of the gold standard, we employed two different
already existing disease, e.g., aumentata perdita di capelli, metrics: pairwise F1 score [15, 16] and token-level agree-
we annotate the minimum possible span, i.e., perdita di ment percentage [17]. The pairwise F1 score was used to
capelli. calculate the IAA for the "Molecule" and "Usage" labels,
For drugs containing more than one molecule, side as the information contained in the text for these entities
effects are reported along with the side effects for each refers to unique and well-defined concepts. This metric
individual molecule. We annotate all of them. provides a balanced measure of the precision and recall
Side effects can be reported with reference to some of the annotations, allowing us to quantify the level of
agreement between annotators on the identification of di calcio nel sangue and ipercalcemia.
these specific entities. This choice aims at accounting for both entities as possi-
On the other hand, for the "Dosage", "Drug Interaction", ble correct answers.
and "Side Effect" classes, we opted to use the token-level For instance, for the drug NATRILIX, the expected re-
agreement percentage as the IAA metric. This choice sults are as it follows:
was motivated by the fact that these classes involve vari-
able text spans, which can be more challenging to align • Usage: pressione sanguigna elevata, ipertensione
between annotators. Before calculating the token-level arteriosa essenziale
agreement percentage, we performed preprocessing steps • Molecule: indapamide
on the annotated portions, removing punctuation marks • Dosage: 1 compressa al giorno
(such as - and • that indicate a list) and Italian stopwords • Side_effect: eruzioni cutanee, bassi livelli di potas-
from the Spacy Italian language model6 . The token-level sio nel sangue, vomito, porpora ...
agreement percentage provides a more granular assess- • Drug_interaction: litio, chinidina, idrochinidina,
ment of the consistency in the identification of the rel- disopiramide (...)
evant text segments, which is crucial for the accurate
For the drug Trevid, the correct answers would be:
extraction of these types of entities from the source doc-
uments. • Usage: carenza di vitamina D
• Molecule: colecalciferolo
GS Post-processing To ensure high consistency • Dosage: 3-4 gocce al giorno
among annotations and to remove additional informa- • Side_effect: livelli aumentati di calcio nel sangue,
tion that does not meet the specified annotation criteria, ipercalcemia, livelli aumentati di calcio nelle urine,
we perform a post-processing step. During this phase, ipercalciuria, debolezza, astenia, reazioni aller-
we review the GS, using recurring patterns and regular giche, appetito ridotto (...)
expressions to clean the data and correct errors. We also • Drug_interaction: anticonvulsivanti, barbiturici,
carry out manual cleaning to produce the final GS. colestipolo, colestiramina, orlistat (...)
For instance, when applicable, we remove the drug name
mentioned in the posology specification (e.g., one tablet Since this is an information extraction task in a zero-shot
of drug_name once a day) so that only the general infor- setting based on PILs, it is expected that LLMs will be
mation related to the molecule is retained. able to extract the exact terminology used in the differ-
The resulting evaluation dataset contains XXX annotated ent sections of the PILs and provide a list of terms. The
molecules, XXX drug interactions, XXX usage informa- performance will be evaluated based on the metrics de-
tion, and XXX side-effects (Table 4). scribed in 4. Potential limitations in accurately assessing
the performance of LLMs may arise from: 1) the vari-
Class Tot. Entities Unique Entities ability in the models’ choice of terms to extract, and 2)
Molecule 657 657 the provision of terms and their simplifications as two
Usage 2159 2113 entities. In these cases, forcing the LLMs to provide a
Posology 831 827
more structured and less ambiguous output might help,
Drug interaction 8617 8458
Side effect 36748 30313
as currently the gold standard does not account for a
Total 49012 42368 set of synonyms to handle variability in the output, or
employing additional metrics to address the second case.
Table 4
Annotated entities for each class
5. Limitations
One important limitation of the DIMMI dataset is the dis-
4.2. Results claimer provided by the Italian Medicines Agency (AIFA)
The expected results should be presented as a list of en- regarding the content
7
available on their website in section
tities for each of the classes of information about each A. Disclaimer . AIFA states that all the information and
drug. To obtain the result lists, we consider the annotated services offered on their website are provided "as is" and
terms and their simplifications as unique entities e.g., the "with all faults". The Italian Medicines Agency, therefore,
span livelli aumentati di calcio nel sangue (ipercalcemia) does not provide any kind of warranty, either explicit or
(elevated levels of calcium in the blood (hypercalcemia)) implied, regarding the content, including, without limi-
is listed as two separate entities that are livelli aumentati tation, the legality, ownership, suitability, or fitness for
particular purposes or uses.
6 7
https://spacy.io/models/it#it_core_news_lg https://www.aifa.gov.it/en/copyright
This disclaimer from the data source raises concerns By making the DIMMI corpus available under the CC-
about the reliability and quality of the patient informa- BY 4.0 license, the dataset can be freely accessed, utilized,
tion leaflets (PILs) that were used to construct the DIMMI and built upon by the scientific community, contribut-
corpus. While the dataset has been carefully curated ing to the advancement of research and applications in
and annotated, the underlying data may contain errors, the field of biomedical text mining and pharmacological
inaccuracies, or other issues that are not explicitly ac- information extraction.
knowledged by the original provider. Researchers and
developers using the DIMMI dataset should be aware of
this limitation and exercise caution when relying on the Acknowledgments
information contained within the corpus, particularly for
Luca Giordano has been supported by Borsa di Studio
critical applications or decision-making processes.
GARR "Orio Carlini" 2023/24 - Consortium GARR, the
National Research and Education Network.
6. Ethical issues
Ethical considerations are crucial when working with
a dataset that contains sensitive information from PILs. References
The DIMMI corpus, which is derived from the AIFA (Ital-
[1] W. H. Shrank, J. Avorn, Educating patients about
ian Medicines Agency) Database, must be handled with
their medications: the potential and limitations of
the utmost care and respect for individual privacy, data
written drug information, Health affairs 26 (2007)
protection, and the diversity of the target population.
731–740.
Additionally, the use of the DIMMI corpus for the de-
[2] P. Rodríguez, R. Azarola, S. Lorda, B. Cantalejo,
velopment and evaluation of natural language processing
A. Danet, et al., Quality improvement of health
models must be guided by ethical principles that consider
information included in drug information leaflets.
the diversity of the target population. The models trained
patient and health professional expectations, Aten-
on this data should be designed and deployed in a way
cion primaria 42 (2009) 22–27.
that respects individual privacy, avoids potential mis-
[3] M. Á. Piñero-López, P. Modamio, C. F. Lastra, E. L.
use or discrimination, and ultimately benefits the public
Mariño, Readability analysis of the package leaflets
good, regardless of ethnicity or age. Careful considera-
for biological medicines available on the internet
tion should be given to the potential societal impact of
between 2007 and 2013: an analytical longitudinal
the applications built upon the DIMMI dataset, ensuring
study, Journal of medical Internet research 18 (2016)
that they are inclusive and equitable.
e100.
By upholding the ethical standards in the handling and
[4] I. Segura-Bedmar, P. Martínez, Simplifying drug
utilization of the DIMMI corpus, the research community
package leaflets written in spanish by using word
can ensure that the valuable pharmacological information
embedding, Journal of biomedical semantics 8
contained in the PILs is leveraged responsibly and in a
(2017) 1–9.
manner that prioritizes the well-being of patients and
[5] M. Yuan, P. Bao, J. Yuan, Y. Shen, Z. Chen, Y. Xie,
the general public, while respecting the diversity of the
J. Zhao, Y. Chen, L. Zhang, L. Shen, et al., Large
target population.
language models illuminate a progressive pathway
to artificial healthcare assistant: A review, arXiv
7. Data license and copyright preprint arXiv:2311.01918 (2023).
[6] A. B. Abacha, E. Agichtein, Y. Pinter, D. Demner-
issues Fushman, Overview of the medical question an-
swering task at trec 2017 liveqa., in: TREC, 2017,
The DIMMI corpus has been created using the patient in-
pp. 1–12.
formation leaflets (PILs) from the AIFA (Italian Medicines
[7] A. B. Abacha, Y. Mrabet, M. Sharp, T. R. Goodwin,
Agency) Database. As reported in the Web site8 , the distri-
S. E. Shooshan, D. Demner-Fushman, Bridging the
bution license used by AIFA for these data is the Creative
gap between consumers’ medication questions and
Commons Attribution (CC-BY) license, version 4.0. This
trusted answers, in: MEDINFO 2019: Health and
license allows third parties to distribute, modify, adapt,
Wellbeing e-Networks for All, IOS Press, 2019, pp.
and use the data, even for commercial purposes, with the
25–29.
sole requirement of providing attribution to the original
[8] V. Nguyen, S. Karimi, M. Rybinski, Z. Xing,
source.
Medredqa for medical consumer question answer-
ing: Dataset, tasks, and neural baselines, in: Pro-
8
https://www.aifa.gov.it/en/copyright ceedings of the 13th International Joint Conference
on Natural Language Processing and the 3rd Con-
ference of the Asia-Pacific Chapter of the Associa-
tion for Computational Linguistics (Volume 1: Long
Papers), 2023, pp. 629–648.
[9] M. Filannino, Ö. Uzuner, Advancing the state of the
art in clinical natural language processing through
shared tasks, Yearbook of medical informatics 27
(2018) 184–192.
[10] R. Vaishya, A. Misra, A. Vaish, Chatgpt: Is this ver-
sion good for healthcare and research?, Diabetes &
Metabolic Syndrome: Clinical Research & Reviews
17 (2023) 102744.
[11] P. Lee, S. Bubeck, J. Petro, Benefits, limits, and risks
of gpt-4 as an ai chatbot for medicine, New England
Journal of Medicine 388 (2023) 1233–1239.
[12] S. Gilbert, H. Harvey, T. Melvin, E. Vollebregt,
P. Wicks, Large language model ai chatbots re-
quire approval as medical devices, Nature Medicine
29 (2023) 2396–2398.
[13] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
naldi, D. Scalena, CALAMITA: Challenge the Abili-
ties of LAnguage Models in ITAlian, in: Proceed-
ings of the 10th Italian Conference on Computa-
tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
ber 4 - December 6, 2024, CEUR Workshop Proceed-
ings, CEUR-WS.org, 2024.
[14] L. Giordano, M. P. Di Buono, Large language
models as drug information providers for pa-
tients, in: Proceedings of the First Work-
shop on Patient-Oriented Language Processing
(CL4Health)@ LREC-COLING 2024, 2024, pp. 54–
63.
[15] G. Hripcsak, A. S. Rothschild, Agreement, the f-
measure, and reliability in information retrieval,
Journal of the American medical informatics asso-
ciation 12 (2005) 296–298.
[16] L. Deleger, Q. Li, T. Lingren, M. Kaiser, K. Molnar,
L. Stoutenborough, M. Kouril, K. Marsolo, I. Solti,
et al., Building gold standard corpora for medical
natural language processing tasks, in: AMIA An-
nual Symposium Proceedings, volume 2012, Ameri-
can Medical Informatics Association, 2012, p. 144.
[17] C. Grouin, S. Rosset, P. Zweigenbaum, K. Fort,
O. Galibert, L. Quintard, Proposal for an extension
of traditional named entities: From guidelines to
evaluation, an overview, in: Proceedings of the 5th
linguistic annotation workshop, 2011, pp. 92–100.