<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DIMMI - Drug InforMation Mining in Italian: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafaele Manna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Pia di Buono</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Giordano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Naples "L'Orientale"</institution>
          ,
          <addr-line>via Duomo 219, 80139 Napoli</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1960</year>
      </pub-date>
      <abstract>
        <p>Patients' knowledge about drugs and medications is crucial as it allows them to administer them safely. This knowledge frequently comes from written prescriptions, patient information leaflets (PILs), or from reading drug Web pages. DIMMI (Drug InforMation Mining in Italian) is a challenge aiming at evaluating the proficiency of Large Language Models in extracting drug-specific information from PILs. The challenge seeks to advance the understanding of efectiveness in processing complex medical information in Italian, and to enhance drug information extraction and pharmacovigilance eforts. Participants are provided with a dataset of 600 Italian PILs and the objective is to develop models capable of accurately answering specific questions related to drug dosage, usage, side efects, drug-drug interactions. The challenge should be approached as an information extraction task through a zero-shot mode, purely based on the model pre-existing knowledge and understanding or through in-context learning (Retrieval-Augmented Generation (RAG) or few-shot mode). The answers generated by the models will be compared against the gold standard (GS), created to establish a reliable, accurate, and a comprehensive set of answers against which participant submissions can be evaluated. For each drug and each information category, the GS contains the correct information extracted from the leaflets through a manual annotation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Patient information leaflets</kwd>
        <kwd>Information extraction</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Italian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy
* Corresponding author.
† These authors contributed equally.
$ rafaele.manna@unior.it (R. Manna); mpdibuono@unior.it
(M. P. d. Buono); giordanoluca.uni@gmail.com (L. Giordano)</p>
      <p>0009-0006-6285-8557 (R. Manna); 0009-0009-9372-3323
(M. P. d. Buono); 0009-0002-3048-4408 (L. Giordano)</p>
      <p>
        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License due to the presence of model hallucinations, potentially
1GUIDELIANttrEibuOtionN4.0TInHterEnatRioEnaAl(CDC ABYB4.I0L).ITY OF THE LABELLING AND causing medical malpractice [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], as any concealed
inPACKAGE LEAFLET OF MEDICINAL PRODUCTS FOR HUMAN accuracies in diagnoses and health advice could lead to
USE - European Commission (2009). severe outcomes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For these reasons, in the evolving
https://health.ec.europa.eu/document/download/ landscape of Artificial Intelligence (AI) applications in
d8612682-ad17-40e3-8130-23395ec80380_en
medicine, considerations have been raised regarding the represents a subset of 600 entries randomly selected from
regulatory approval of LLMs as medical devices, high- the D-LeafIT corpus.
lighting the ethical and legal dimensions associated with It is worth stressing that the information is extracted
deploying such technologies in healthcare settings [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. from pdf files and converted into texts, this means that
      </p>
      <p>
        To delve deeper into this topic, within the CALAMITA some errors and typos may occur. Furthermore, the
origcampaign [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], we introduce DIMMI (Drug InforMation inal D-LeafIT presents some data noise, e.g., the
presMining in Italian), a challenge centered on evaluating the ence of paratext, and wrong encoding from pdf files. To
proficiency of LLMs in extracting drug-specific informa- ifx these issues, we perform a cleaning procedure as a
tion from Italian PILs. pre-processing phase, to obtain the final dataset. The
By this, the task aims at contributing to the development procedure is mainly automatic and based on recurrent
of AI systems for enhancing drug information extraction patterns, so that some of the aforementioned issues could
and pharmacovigilance eforts, specifically for the Italian be still present. The dataset pre-processing phase can be
language. summarized in two main steps, that are:
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. DIMMI</title>
      <sec id="sec-2-1">
        <title>As DIMMI seeks to advance the understanding of LLM</title>
        <p>efectiveness in processing complex medical information
in Italian, participants are provided with the complete
leaflets for each drug and the objective is to develop
models capable of accurately answering specific questions
related to a drug, such as its dosage, usage, etc.
The challenge should be approached as an information
extraction task through a zero-shot mode, purely based
on the model pre-existing knowledge and
understanding or through in-context learning (Retrieval-Augmented
Generation (RAG) or few-shot mode). The answers
generated by the models will be compared against the gold
standard (GS), created to establish a reliable, accurate,
and comprehensive set of answers against which
participant submissions can be evaluated. For each drug and
each information category (e.g., dosage, usage, side
effects, drug-drug interactions), the GS contains the correct
information extracted from the leaflets, manually
annotated according to some categories described in Section
4.1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data description</title>
      <sec id="sec-3-1">
        <title>3.1. Origin of data</title>
        <sec id="sec-3-1-1">
          <title>The challenge dataset is derived from the D-LeafIT Cor</title>
          <p>
            pus [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], available on GitHub2, made up of 1819 Italian
drug package leaflets. The corpus has been created
extracting PILs available on the Italian Agency for
Medications (Agenzia Italiana del Farmaco - AIFA) website3,
among which 1439 refer to generic drugs and 380 to class
A drugs.
          </p>
          <p>In the original corpus, the generic drug leaflets amount to
6,154,007 tokens while the class A to 1,650,879 tokens, for
a total amount of 7,804,886 tokens. The DIMMI dataset
2https://github.com/unior-nlp-research-group/D-LeafIT
3https://www.aifa.gov.it/en/home
• Correcting the separation of each leaflets by
identifying regular patterns which indicate the
beginning/end of a unique leaflet.
• Removing additional information about the
issue date, the pharmaceutical company, and the
marketing authorization.</p>
          <p>Additionally, we notice the presence of several cases of
duplicate entries, due to diferent reasons, as described
below:
1. Same drug name, same dosage form, same
ingredient amount, diferent issue dates → These
cases indicate that the leaflet has been updated
and all the versions are recorded into the AIFA
repository. In such cases, on the basis of their ID,
the less recent leaflet has been removed .
2. Same drug name, same dosage form, diferent
ingredient amount → These cases may present, or
not, the same information leaflets. We do not
remove the duplicate entries, even though they
present the same information about the classes
we are interested in.
3. Same drug name, diferent dosage form , same
ingredient amount → These duplicates are not
removed as dosage information can be
diferentiated on the basis of the drug form.
4. Same drug name, same dosage form, same
ingredient amount, diferent pharmaceutical
company - These duplicates are removed and just
one entry is kept. We usually prefer keeping the
one reporting in the name ’DOC generici’. If
this is not possible, we keep the first occurrence.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data format</title>
        <sec id="sec-3-2-1">
          <title>The whole leaflets are provided in the dataset, so that the</title>
          <p>context is available. Additionally we provide the drug
name for each leaflet. The final dataset, released 4 as a .tsv
(tab-separated values) format, contains four columns. For</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>4https://huggingface.co/datasets/RafaMann/DIMMI</title>
          <p>Drug_Name
BOTAM</p>
          <p>Text</p>
          <p>BOTAM 0,4 mg capsule (...) Tamsulosina cloridrato Medicinale equivalente (...).
each entry we present an ID, an ID_LOC which indicates 2. Per cosa si usa {drug_name}? (What is
the id location in the original corpus, the drug name {drug_name} used for?) - to extract the usage
(without any reference to the ingredient amount, the 3. Qual è la posologia raccomandata per
dosage form, and the pharmaceutical company), and the {drug_name}? (What is the recommended
leaflet text (Table 1). dosage for {drug_name}?) - to extract the dosage</p>
          <p>Participants in the DIMMI challenge are required to 4. Quali sono gli efetti collaterali di {drug_name}?
use LLMs to extract the following information from the (What are the potential side efects of taking
PILs text: ’Molecule’, ’Usage’, ’Dosage’, ’Drug Interaction’, {drug_name}?) - to extract side efects
and ’Side Efect’. These information must be provided 5. Con quali medicinali interagisce {drug_name}?
as output in a structured format such as TSV or JSON, (What are the drug interaction of {drug_name}?)
with reference to each ID and drug name contained in the - to extract the interaction with other drugs
evaluation dataset. The information extracted for each
ID and drug name with reference to ’Molecule’, ’Usage’, The latter type of prompt aims at extracting all the
rele’Dosage’, ’Drug Interaction’, and ’Side Efect’ must be vant information with a specific instruction to help the
represented in the form of a list of strings (see Section model understand the expected output structure and
fa4.2). cilitates extraction as it follows:
The evaluation dataset for the DIMMI corpus contains • Fornisci le seguenti informazioni su {drug_name}:
caogleu’,m’Dnsosfaogret’h,e’Dforullgo wInitnegraecnttiiotny’,tyapneds’:S’iMdeolEefeccutl’e.’,F’oUrs- UMsool:ecola:
seetanrctiinhtgyisn-,ssprteaepncricficeesec(dnortluuingmglnetshafleeata)reninnpotohtapetueDldaItMeendMtitIwiecisot hropfuatsh,leitshctoeosrerfe- PEofetstoilocoglilaa:terali :
sponding type. (IPntreorvaizdieonitchoen aflotrlilomweidnigcinainlif:ormation about
The ’Molecule’ column will contain a list of the unique {drug_name}:
molecular entities mentioned in the text, while the ’Usage’ Molecule:
column will include a list of the specific uses or indica- Usage:
tions for the drug. The ’Dosage’ column will hold a list of Dosage:
the textual spans describing the dosage, administration, Side Efects:
or regimen information. The ’Drug Interaction’ column Drug interaction:)
will contain a list of the potential interactions with other
drugs, and the ’Side Efect’ column will include a list of
the adverse efects associated with the drug. 3.4. Dataset statistics</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompting</title>
        <sec id="sec-3-3-1">
          <title>For each drug in the dataset, we evaluate the results from</title>
          <p>two types of zero-shot prompts in Italian, i.e., specific
task-focused prompts and structured prompts.</p>
          <p>The former type is composed of five questions for each
of the information type we want to extract, as reported
below5:</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>1. Qual è la molecola di {drug_name}? (What is</title>
          <p>the molecule of {drug_name}?) - to extract the
molecule</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>5It is worth stressing that in the prompt examples {drug_name} is</title>
          <p>not a masked word, it represents a placeholder to indicate one of
the entries from the column drug_name in DIMMI dataset.
As mentioned before, the final dataset is composed by
600 unique PILs in Italian, providing a comprehensive
dataset for the challenge. The documents in the DIMMI
dataset exhibit a wide range of lengths (Table 2), with the
shortest document containing 363 tokens and the longest
extending to 11,730 tokens. This range in token count
directly corresponds to the word count, indicating that
each word is treated as a single token in this analysis. On
average, each document contains approximately 2,520
words, with a standard deviation of 848 words, indicating
moderate variability in document length. The
distribution of document lengths is further characterized by the
25th, 50th (median), and 75th percentiles, which are 1,960,
2,448, and 2,980.75 words, respectively.</p>
          <p>In total, the corpus contains 1,511,724 words and
tokens. The lexical diversity of the corpus is reflected in the
num_documents
mean_length
min_length
max_length
std_length
percentiles
total_words
mean_words_per_doc
total_tokens
min_tokens
max_tokens
unique_tokens
type_token_ratio
58,901 unique tokens identified, resulting in a type-token
ratio (TTR) of 0.0390. This relatively low TTR suggests a
high degree of repetition within the text, which is
typical for technical and regulatory documents such as drug
package leaflets. Importantly, there are no empty
documents in the corpus, ensuring that all entries contribute
meaningful content to the dataset.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation metrics</title>
      <sec id="sec-4-1">
        <title>We will evaluate the results using accuracy, precision, recall and F-1 score using a gold standard as benchmark (see Section 4.1). The details for each metric are provided below:</title>
        <p>In order to evaluate the system results, we created a gold
standard (GS), manually annotating the following
categories: i) molecule; ii) dosage; iii) drug interaction; iv)
usage; v) side efect. For each of the aforementioned
classes we define some guidelines and specifications for
the annotation, as summarised in the following
paragraphs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Molecule The category is used to identify the main ingredient(s) of the drug. In some cases, the bulking agent(s) may be reported together with the molecule(s). These are not included in the molecule class.</title>
      </sec>
      <sec id="sec-4-3">
        <title>Dosage information This class refers to the recommended dosage for drug administration. We do not annotate the treatment duration neither the maximum dosage in the dosage information.</title>
        <p>For dosage information we distinguish between dosage
for children and adults. We do not distinguish dosage
for infants or elders (the former is annotated as dosage
information for children, the latter as dosage information
for adults, as reported below).</p>
        <p>When the same dosage can be used for both adults and
children, the general dosage information category is
applied.</p>
        <p>Example:
10 mg una volta al giorno negli adulti e nei bambini di età
uguale o superiore ai 10 anni
(10 mg once a day in adults and children aged 10 years
or older)
• Precision metric: For example: Dosage: If the
model extracts "200mg-400mg every 4-6 hours"
and this is correct, the precision for dosage is Furthermore, dosage information could be
difer100%; Side Efects: If the model extracts "Stom- entiated on the basis of age/weight. In such cases,
ach upset, nausea" and this is partially correct unless dosages for adults and children are explicitly
(missing other side efects), the precision for side diferentiated, we always use the general category
efects might be 50% (depending on how many dosage.</p>
        <p>side efects are correctly identified);
• Recall metric: For example: Dosage: If the cor- Example:
rect dosage is "200mg-400mg every 4-6 hours" Adulti, anziani e bambini di età pari o superiore a 12 anni
and the model extracts only "200mg-400mg," the con un peso corporeo pari o superiore a 50 chilogrammi
recall for dosage is 50%. Side Efects: If the correct (kg): • da 1 a 2 g una volta al giorno a seconda della
side efects are "Stomach upset, nausea, dizziness, gravità e del tipo di infezione
headache" and the model extracts "Stomach upset, (Adults, elderly, and children aged 12 years and older
nausea," the recall for side efects is 50%. with a body weight of 50 kilograms (kg) or more: • 1 to
• F1-score metric: A balanced measure of preci- 2 g once a day, depending on the severity and type of
sion and recall. A higher F1-score indicates better infection.)
performance.
• Accuracy: The overall percentage of correct ex- Dosage for infants can be expressed through a
cotractions across all classes. As far as this metric reference to some other dosage, e.g., for adults or
is concerned, we also evaluate the class-Level children, sometimes with a diferent time schedule, as
Accuracy, as the accuracy for each specific class in lo stesso dosaggio sopra descritto ma somministrato
separately. una volta ogni due giorni (The same dosage as described
the minimal span that conveys the information about
drug interactions, as follows:</p>
      </sec>
      <sec id="sec-4-4">
        <title>1. Molecule</title>
        <p>2. Drug class
3. Drug use
above, but administered once every two days.). Unless
the dosage is explicitly mentioned, we do not annotate
these spans, as the information is context-dependent.
The treatment of specific diseases might require diferent
dosages for the same drug. When they are reported in
the leaflet, following a minimum span principle, we
annotate all the dosages without any specification about
the disease. Due to the aforementioned annotation
choice, the annotation results will be a set of dosage
information, as in the following example (annotated
spans reported in bold face).</p>
      </sec>
      <sec id="sec-4-5">
        <title>The aforementioned hierarchy helps us identify the span</title>
        <p>to be annotated. When included, drug names are always
annotated.</p>
        <p>When the interaction information is reported with the
specific pharmaceutical form (e.g., eritromicina
iniettabile), only the minimal possible span is annotated, i.e.,
Example: eritromicina.</p>
        <p>Aspergillosi: - 2 capsule una volta al giorno per un In some cases, examples of interacting molecules or drug
periodo di 2-5 mesi; (...) Candidosi: 1-2 capsule 1 volta names are provided alongside the drug class or use (e.g.,
al giorno per un periodo da 3 settimane a 7 mesi (...) medicinali usati per il trattamento dell’HIV AIDS, per
esCriptococcosi non meningea: 2 capsule una volta al empio ketoconazolo e itraconazolo - medicines used for
giorno per un periodo dai 2 mesi ad 1 anno (...) the treatment of HIV/AIDS, for example, ketoconazole
and itraconazole). In these instances, we annotate both,
When the same dosage can be applied in more as the list of drugs and molecules may not be exhaustive.
than one cases, span duplicates may be present (e.g., 2 If the list is exhaustive, we do not annotate the general
capsule una volta al giorno). In the final GS, these are reference to the drug use; we only annotate the drug
removed so that only one span for each type is kept. molecules or names.</p>
        <p>Some drugs must be administered according to a Interactions with some other molecules can be
condischedule that spans diferent time periods, with or tioned by the taken amount, e.g. cimetidina, preso in
without dosage variations. In such cases, we annotate dosi giornaliere superiori a 800mg (cimetidine, taken in
only the initial recommended dosage. daily doses greater than 800 mg). Also in these cases the
molecule name is the only span annotated.</p>
        <p>In some cases, the posology section does not pro- Some interacting drugs are reported as the general drug
vide specific dosage information and instead includes a class, together with a plain language explanation and a
general recommendation to consult a doctor. In these subclass specification, as in the following example
instances, we consider the information to be missing and diuretici (compresse per urinare in particolare quelli
chiado not annotate the general statement. mati risparmiatori di potassio) (diuretics (tablets for
urination, particularly those called potassium-sparing))
Drug interaction As for drug interactions„ we anno- As the molecule is not noted, we do annotate both the
tate the name of molecules and drugs when they are general class and sublcass (both in bold face in the
previavailable. In some cases, the information about drug in- ous excerpt).
teraction is reported as a general reference to the use of Additionally, also food and beverage can interact with
some drugs (e.g., medicinali per abbassare la pressione - drugs, e.g., pompelmo, alcol (grapefruit, alcohol). We opt
medicines to lower blood pressure). In such instances, not to include these substances within the drug
interacas we cannot identify the specific molecule or drug, we tion class, as we want to focus only on the pharmaceutical
annotate the general reference. Information about drug drug interaction.
interactions may also appear as a reference to certain Drug interaction information are considered missing
types of relationships with other molecules, as in derivati when there is only a general sentence to the fact that
della fenotiazina (phenothiazine derivatives). For our an- the use of any further drug should be reported.
notations, we omit additional information and select the
minimal span, in the aforementioned example, fenotiaz- Usage With respect to usage, we consider the
miniina (phenothiazine). mal possible span, which indicates the disease treated
Similarly, when the information pertains to the drug by the specific drug. Thus, for instance, in the
senclass instead of reporting the molecule, e.g., lassativi (lax- tence {drug_name} è usato nel trattamento della gotta
atives), we annotate the minimal span, even though in ({drug_name} is used in the treatment of gout), we
annosome cases the drug use is specified, e.g., medicinali usati tate only gotta (gout).
per trattare la stipsi (medicines used to treat constipa- In other cases, some examples of usage may be reported
tion). as in traumi (ad esempio causati dallo sport) (injuries (for
We apply a hierarchical priority to identify and annotate example, those caused by sports)). As those cases are not
representative enough of usages, we do not include them
in the annotation, so in the previous excerpt we annotate
just traumi (injuires).</p>
        <p>Within the usage section, sometimes the use of plain text
is reported together with reference to the specific disease,
e.g., meningite cirptococcica - un’infezione micotica del
cervello (...). We always annotate the specific term for the
disease and discard the plain text description.</p>
        <p>When the generic disease class is presented, e.g., infezioni
cutanee (skin infections), followed by a non exhaustive
list of examples, we annotate just the generic use.</p>
        <p>Side efects This class indicates all the possible side
efects caused by the drug consumption. In PILs, this
type of information is generally grouped on the basis
of the number of people afected by the side efects to
identify diferent difusion levels, e.g., very common side
efects, very rare side efects. We do not diferentiate
among the difusion levels and consider all the side Inter-Annotator Agreement The annotation has
efects belonging to the same class side_effect. In been performed by three people with computational
linsome cases, side efects afecting other subjects than the guistic backgrounds and diferent levels of expertise. An
person consuming the drug are reported. For instance, initial inter-annotator agreement has been evaluated
afsome drugs can afect the fetus as in the following ter the first draft of guidelines has been created.
Borderexcerpt. line cases and issues have been collected by each of the
annotators and subsequently discussed and solved. The
Example: guidelines have been updated accordingly and a second
(...) Se assume Ricap durante le ultime fasi della gravi- round of annotation has been performed in order to
comdanza, il suo bambino potrebbe manifestare i seguenti pute the final inter-annotator agreement.
sintomi: problemi a respirare, colorito bluastro o violaceo The annotation round for evaluating the final
interdella pelle, convulsioni (...). annotator agreement has been performed on a subset
[(...) If you take Ricap during the later stages of of 60 leaflets.
pregnancy, your baby may experience the following The results, calculated before the post-processing
symptoms: breathing problems, bluish or purplish skin phase, show a complete agreement on the molecule class
discoloration, seizures (...)] among all the annotators, while for the remaining classes
the agreement spans from .61 for posology and .80 for
side efects (Table 3).
patient/disease type, e.g., Se è HIV positivo può mostrare
efetti indesiderati (If you are HIV positive, you may
experience side efects) . In such cases, symptoms are
annotated without any further specification.</p>
        <p>If duplicates are presented, those are not annotated or
removed in the post-processing phase, so that just one
entry for symptom type is recorded in the GS.</p>
        <p>Sometimes, side efects are grouped by indicating the
general area (e.g., organ or functionality) afected, e.g.,
nervous system disorders. The information might be
followed by a list of specific side efects. When this is
the case, we discard the general information in favor of
the most specific one.</p>
        <p>It is worth stressing that other information may be
presented in PILs, for instance Precautions for use. As
we are not interested in this type of information, we do
not annotate such sections.</p>
      </sec>
      <sec id="sec-4-6">
        <title>We do not annotate these secondary side efects</title>
        <p>and the ones derived from drug overdose.</p>
        <p>When the side efect type is reported together with
its symptoms we do include those within the class
of side efects. For instance, in some cases a list of
symptoms dificoltà respiratoria, riduzione della pressione
sanguigna is combined with the general side efect
reazioni allergiche. Each of them is annotated separately
and included into the list of side efects.</p>
        <p>Similarly, we annotate both the plain language side
efect and the term, as in problemi del flusso della bile
(colestasi) (bile flow problems (cholestasis)).</p>
        <p>When the side efects are reported as worsening of an
already existing disease, e.g., aumentata perdita di capelli,
we annotate the minimum possible span, i.e., perdita di
capelli.</p>
        <p>For drugs containing more than one molecule, side
efects are reported along with the side efects for each
individual molecule. We annotate all of them.</p>
        <p>Side efects can be reported with reference to some
Class
Molecule
Usage
Posology
Drug interaction
Side efects</p>
        <p>A1/A2
1
.69
.61
.66
.80</p>
        <p>A1/A3
1
.67
.62
.66
.76</p>
        <p>A2/A3
1
.68
.66
.65
.75</p>
        <p>AVG
1
.68
.63
.66
.78</p>
        <p>
          To assess the inter-annotator agreement (IAA) for the
creation of the gold standard, we employed two diferent
metrics: pairwise F1 score [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ] and token-level
agreement percentage [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The pairwise F1 score was used to
calculate the IAA for the "Molecule" and "Usage" labels,
as the information contained in the text for these entities
refers to unique and well-defined concepts. This metric
provides a balanced measure of the precision and recall
of the annotations, allowing us to quantify the level of
agreement between annotators on the identification of di calcio nel sangue and ipercalcemia.
these specific entities. This choice aims at accounting for both entities as
possi
        </p>
        <p>On the other hand, for the "Dosage", "Drug Interaction", ble correct answers.
and "Side Efect" classes, we opted to use the token-level For instance, for the drug NATRILIX, the expected
reagreement percentage as the IAA metric. This choice sults are as it follows:
was motivated by the fact that these classes involve
variable text spans, which can be more challenging to align • Usage: pressione sanguigna elevata, ipertensione
between annotators. Before calculating the token-level arteriosa essenziale
agreement percentage, we performed preprocessing steps • Molecule: indapamide
on the annotated portions, removing punctuation marks • Dosage: 1 compressa al giorno
(such as - and • that indicate a list) and Italian stopwords • Side_efect: eruzioni cutanee, bassi livelli di
potasfrom the Spacy Italian language model6. The token-level sio nel sangue, vomito, porpora ...
agreement percentage provides a more granular assess- • Drug_interaction: litio, chinidina, idrochinidina,
ment of the consistency in the identification of the rel- disopiramide (...)
evant text segments, which is crucial for the accurate
extraction of these types of entities from the source doc- For the drug Trevid, the correct answers would be:
uments.</p>
        <p>GS Post-processing To ensure high consistency
among annotations and to remove additional
information that does not meet the specified annotation criteria,
we perform a post-processing step. During this phase,
we review the GS, using recurring patterns and regular
expressions to clean the data and correct errors. We also
carry out manual cleaning to produce the final GS.</p>
        <p>For instance, when applicable, we remove the drug name
mentioned in the posology specification (e.g., one tablet Since this is an information extraction task in a zero-shot
of drug_name once a day) so that only the general infor- setting based on PILs, it is expected that LLMs will be
mation related to the molecule is retained. able to extract the exact terminology used in the
diferThe resulting evaluation dataset contains XXX annotated ent sections of the PILs and provide a list of terms. The
molecules, XXX drug interactions, XXX usage informa- performance will be evaluated based on the metrics
detion, and XXX side-efects (Table 4). scribed in 4. Potential limitations in accurately assessing
the performance of LLMs may arise from: 1) the
variClass Tot. Entities Unique Entities ability in the models’ choice of terms to extract, and 2)
Molecule 657 657 the provision of terms and their simplifications as two
Usage 2159 2113 entities. In these cases, forcing the LLMs to provide a
Posology 831 827 more structured and less ambiguous output might help,
SDirduegeifenctteraction 368764187 308341538 as currently the gold standard does not account for a
Total 49012 42368 set of synonyms to handle variability in the output, or
employing additional metrics to address the second case.</p>
        <p>• Usage: carenza di vitamina D
• Molecule: colecalciferolo
• Dosage: 3-4 gocce al giorno
• Side_efect: livelli aumentati di calcio nel sangue,
ipercalcemia, livelli aumentati di calcio nelle urine,
ipercalciuria, debolezza, astenia, reazioni
allergiche, appetito ridotto (...)
• Drug_interaction: anticonvulsivanti, barbiturici,
colestipolo, colestiramina, orlistat (...)
The expected results should be presented as a list of
entities for each of the classes of information about each
drug. To obtain the result lists, we consider the annotated
terms and their simplifications as unique entities e.g., the
span livelli aumentati di calcio nel sangue (ipercalcemia)
(elevated levels of calcium in the blood (hypercalcemia))
is listed as two separate entities that are livelli aumentati</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <sec id="sec-5-1">
        <title>One important limitation of the DIMMI dataset is the dis</title>
        <p>claimer provided by the Italian Medicines Agency (AIFA)
regarding the content available on their website in section
A. Disclaimer7. AIFA states that all the information and
services ofered on their website are provided "as is" and
"with all faults". The Italian Medicines Agency, therefore,
does not provide any kind of warranty, either explicit or
implied, regarding the content, including, without
limitation, the legality, ownership, suitability, or fitness for
particular purposes or uses.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6https://spacy.io/models/it#it_core_news_lg</title>
      </sec>
      <sec id="sec-5-3">
        <title>7https://www.aifa.gov.it/en/copyright</title>
        <p>This disclaimer from the data source raises concerns By making the DIMMI corpus available under the
CCabout the reliability and quality of the patient informa- BY 4.0 license, the dataset can be freely accessed, utilized,
tion leaflets (PILs) that were used to construct the DIMMI and built upon by the scientific community,
contributcorpus. While the dataset has been carefully curated ing to the advancement of research and applications in
and annotated, the underlying data may contain errors, the field of biomedical text mining and pharmacological
inaccuracies, or other issues that are not explicitly ac- information extraction.
knowledged by the original provider. Researchers and
developers using the DIMMI dataset should be aware of
this limitation and exercise caution when relying on the Acknowledgments
information contained within the corpus, particularly for
critical applications or decision-making processes.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Luca Giordano has been supported by Borsa di Studio</title>
        <p>GARR "Orio Carlini" 2023/24 - Consortium GARR, the
National Research and Education Network.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Ethical issues</title>
      <sec id="sec-6-1">
        <title>Ethical considerations are crucial when working with</title>
        <p>a dataset that contains sensitive information from PILs.
The DIMMI corpus, which is derived from the AIFA
(Italian Medicines Agency) Database, must be handled with
the utmost care and respect for individual privacy, data
protection, and the diversity of the target population.</p>
        <p>Additionally, the use of the DIMMI corpus for the
development and evaluation of natural language processing
models must be guided by ethical principles that consider
the diversity of the target population. The models trained
on this data should be designed and deployed in a way
that respects individual privacy, avoids potential
misuse or discrimination, and ultimately benefits the public
good, regardless of ethnicity or age. Careful
consideration should be given to the potential societal impact of
the applications built upon the DIMMI dataset, ensuring
that they are inclusive and equitable.</p>
        <p>By upholding the ethical standards in the handling and
utilization of the DIMMI corpus, the research community
can ensure that the valuable pharmacological information
contained in the PILs is leveraged responsibly and in a
manner that prioritizes the well-being of patients and
the general public, while respecting the diversity of the
target population.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Data license and copyright issues</title>
      <sec id="sec-7-1">
        <title>The DIMMI corpus has been created using the patient in</title>
        <p>formation leaflets (PILs) from the AIFA (Italian Medicines
Agency) Database. As reported in the Web site8, the
distribution license used by AIFA for these data is the Creative
Commons Attribution (CC-BY) license, version 4.0. This
license allows third parties to distribute, modify, adapt,
and use the data, even for commercial purposes, with the
sole requirement of providing attribution to the original
source.</p>
      </sec>
      <sec id="sec-7-2">
        <title>8https://www.aifa.gov.it/en/copyright</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Shrank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Avorn</surname>
          </string-name>
          ,
          <article-title>Educating patients about their medications: the potential and limitations of written drug information</article-title>
          ,
          <source>Health afairs 26</source>
          (
          <year>2007</year>
          )
          <fpage>731</fpage>
          -
          <lpage>740</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Azarola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lorda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cantalejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Danet</surname>
          </string-name>
          , et al.,
          <article-title>Quality improvement of health information included in drug information leaflets. patient and health professional expectations</article-title>
          ,
          <source>Atencion primaria 42</source>
          (
          <year>2009</year>
          )
          <fpage>22</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Piñero-López</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Modamio</surname>
            ,
            <given-names>C. F.</given-names>
          </string-name>
          <string-name>
            <surname>Lastra</surname>
            ,
            <given-names>E. L.</given-names>
          </string-name>
          <string-name>
            <surname>Mariño</surname>
          </string-name>
          ,
          <article-title>Readability analysis of the package leaflets for biological medicines available on the internet between 2007 and 2013: an analytical longitudinal study</article-title>
          ,
          <source>Journal of medical Internet research 18</source>
          (
          <year>2016</year>
          )
          <article-title>e100</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Segura-Bedmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <article-title>Simplifying drug package leaflets written in spanish by using word embedding</article-title>
          ,
          <source>Journal of biomedical semantics 8</source>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shen</surname>
          </string-name>
          , et al.,
          <article-title>Large language models illuminate a progressive pathway to artificial healthcare assistant: A review</article-title>
          ,
          <source>arXiv preprint arXiv:2311</source>
          .
          <year>01918</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          , E. Agichtein,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pinter</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>DemnerFushman, Overview of the medical question answering task at trec 2017 liveqa</article-title>
          .,
          <source>in: TREC</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mrabet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Goodwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Shooshan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Demner-Fushman, Bridging the gap between consumers' medication questions and trusted answers</article-title>
          ,
          <source>in: MEDINFO</source>
          <year>2019</year>
          :
          <article-title>Health and Wellbeing e-Networks for All</article-title>
          , IOS Press,
          <year>2019</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rybinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Medredqa for medical consumer question answering: Dataset, tasks, and neural baselines</article-title>
          ,
          <source>in: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>629</fpage>
          -
          <lpage>648</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Filannino</surname>
          </string-name>
          , Ö. Uzuner,
          <article-title>Advancing the state of the art in clinical natural language processing through shared tasks</article-title>
          ,
          <source>Yearbook of medical informatics 27</source>
          (
          <year>2018</year>
          )
          <fpage>184</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vaishya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaish</surname>
          </string-name>
          , Chatgpt:
          <article-title>Is this version good for healthcare</article-title>
          and research?,
          <source>Diabetes &amp; Metabolic Syndrome: Clinical Research &amp; Reviews</source>
          <volume>17</volume>
          (
          <year>2023</year>
          )
          <fpage>102744</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Petro</surname>
          </string-name>
          , Benefits, limits, and
          <article-title>risks of gpt-4 as an ai chatbot for medicine</article-title>
          ,
          <source>New England Journal of Medicine</source>
          <volume>388</volume>
          (
          <year>2023</year>
          )
          <fpage>1233</fpage>
          -
          <lpage>1239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Harvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Melvin</surname>
          </string-name>
          , E. Vollebregt,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wicks</surname>
          </string-name>
          ,
          <article-title>Large language model ai chatbots require approval as medical devices</article-title>
          ,
          <source>Nature Medicine</source>
          <volume>29</volume>
          (
          <year>2023</year>
          )
          <fpage>2396</fpage>
          -
          <lpage>2398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Giordano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Buono</surname>
          </string-name>
          ,
          <article-title>Large language models as drug information providers for patients</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health)@ LREC-COLING</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hripcsak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Rothschild</surname>
          </string-name>
          ,
          <article-title>Agreement, the fmeasure, and reliability in information retrieval</article-title>
          ,
          <source>Journal of the American medical informatics association 12</source>
          (
          <year>2005</year>
          )
          <fpage>296</fpage>
          -
          <lpage>298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Deleger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lingren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Molnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stoutenborough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kouril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marsolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Solti</surname>
          </string-name>
          , et al.,
          <article-title>Building gold standard corpora for medical natural language processing tasks</article-title>
          ,
          <source>in: AMIA Annual Symposium Proceedings</source>
          , volume
          <volume>2012</volume>
          , American Medical Informatics Association,
          <year>2012</year>
          , p.
          <fpage>144</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Grouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Fort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          , L. Quintard,
          <article-title>Proposal for an extension of traditional named entities: From guidelines to evaluation, an overview</article-title>
          ,
          <source>in: Proceedings of the 5th linguistic annotation workshop</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>