<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CLinkaRT at EVALITA 2023: Overview of the Task on Linking a Lab Result to its Test Event in the Clinical Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Begoña Altuna</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Goutham Karunakaran</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Lavelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuela Speranza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Zanoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, 38123 Povo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HiTZ Center - Ixa, University of the Basque Country UPV/EHU</institution>
          ,
          <addr-line>Manuel Lardizabal 1, 20018 Donostia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università di Trento</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>CLinkaRT at EVALITA 2023 is a relation extraction task based on clinical cases taken from the E3C corpus, i.e. Italian written documents reporting statements of a clinical practice. The task consists in identifying clinical results and measures and linking them to the laboratory tests and measurements from which they were obtained. Three teams participated in the task and various supervised machine learning models, both traditional and based on deep learning, were evaluated. In this evaluation, the deep learning models outperformed the traditional ones. Interestingly, none of the teams explored the use of few-shot language modeling. However, the fact that the supervised models significantly outperformed the task baselines implementing few-shot learning shows the crucial role still played by the availability of annotated training data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Relation Extraction</kwd>
        <kwd>Clinical NLP</kwd>
        <kwd>Named Entity Recogniton</kwd>
        <kwd>Supervised Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>
        tus at a certain time of the development of a disorder and
are crucial to choose the right diagnosis. From a more
There is a growing interest in processing clinical data for technical point of view, processing laboratory tests and
tasks of public interest, such as clinical decision making their results also brings up a new perspective on the
treat[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or monitoring of the health status of a country [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. ment of data, since it requires interpreting numeric values
While for this purpose large amounts of structured data and ranges and therefore can not be handled as a
comare needed, the reality is that most clinical data are stored mon named entity recognition task [9]. In this context,
as free unstructured clinical texts. Hence, the ability of the CLinkaRT task (LINKing A Result to its Test in the
extracting information directly from natural language texts CLINnical domain) in EVALITA 2023 [10] provides an
and to increase the volume of databases and structured opportunity to evaluate different Natural Language
Prodatasets, such as MIMIC-III [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], is crucial. cessing approaches and does this with a focus on Italian,
      </p>
      <p>
        Having these goals into account, scholars have devel- a less explored language than English.
oped a series of resources for information extraction from
clinical texts. Clinical information extraction efforts have
often given priority to the identification of diseases [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] 2. Task Description
or events [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As far as the extraction of relations from
clinical texts is concerned, previous work has focused The CLinkaRT task consists in identifying textual
menon concept normalization [6] and temporal relations [7], tions of both laboratory tests and measurements in a
clinamong others. Laboratory tests and measurements and ical narrative, and then linking these to their respective
their results have been given little attention [8], although results. Clinical narratives (or clinical cases) are
docthey provide interesting information on the patients’ sta- uments reporting statements of a clinical practice,
presenting the reason for a clinical visit, the description of
EVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- physical exams, the assessment of the patient’s situation
*ceCsosrinregspaonnddSinpgeeacuhthTooro.ls for Italian, Sep 7 – 8, Parma, IT aLnadbotrhaetodriya gtensotssiasn,dasmweaeslluraesmtehnetsfoalrleowcoinmgmtroenaltymdeonntse.
†$Thbeesgeoanuat.halotrusncao@netrhibuu.etueds (eBq.uaAlllytu.na); as part of this process and are typically documented in
goutham.karunakaran@studenti.unitn.it (G. Karunakaran); clinical narratives.
lavelli@fbk.eu (A. Lavelli); magnini@fbk.eu (B. Magnini); Figure 1 presents an excerpt of a clinical case where
labmanspera@fbk.eu (M. Speranza); zanoli@fbk.eu (R. Zanoli) oratory tests have been marked in bold1 and their results
0000-0002-4027-2014 (B. Altuna)
      </p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License in italics.</p>
      <p>CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutiRon 4W.0Iontrekrnsathioonapl(CPCrBoYc4e.0e)d.ings (CEUR-WS.org) 1Note that the head of the mention is capitalized.</p>
      <sec id="sec-1-1">
        <title>Osvaldo, anni 52, ha una storia di diarrea e calo</title>
        <p>ponderale che si può far riferire a due anni prima.
Non c’è storia di sanguinamento gastroenterico ed
una RICERCA di sangue occulto fecale è risultata
negativa su tre campioni. Ammette di averci dato
dentro con l’alcol in passato, ma da diversi anni è
assolutamente astinente. Ha un diabete, controllato
con insulina. Sei anni prima è stato
colecistectomizzato. Gli ESAMI di laboratorio sono normali, se
si fa eccezione per una lieve anemia, così come
normali sono lo STUDIO radiologico del piccolo
e del grosso intestino.</p>
        <sec id="sec-1-1-1">
          <title>In this example, we have the following Pertains-To re</title>
          <p>lations that participants needed to identify between results
and tests:
• negative / negativa -&gt; the research for fecal blood</p>
          <p>/ una RICERCA di sangue occulto fecale
• normal / normali -&gt; the laboratory exams / gli</p>
          <p>ESAMI di laboratorio
• normal / normali -&gt; the radiological study of the
bowels / lo STUDIO radiologico del piccolo e del
grosso intestino
3. Dataset
3.1. Annotation
Among all the annotations foreseen by the E3C project,
the data used for CLinkaRT contain the following
annotations:
• Laboratory test and measurement EVENTS: they
include medical procedures in which parts of the
body or bodily substances (blood, urine, etc.) are
analyzed, as well as different acts of measuring,
such as measuring patients’ physical features (e.g.</p>
          <p>height and weight) or the size of a lesion or mass.
• RMLs are the results of lab tests and
measurements; they can consist of a text string (e.g.
normal / normali) but more often contain numerical
values, typically followed by a unit of measure
(e.g. 7,5 g/dl);
• Pertains-To relations connecting an RML (the
source) to the relevant EVENT (the target).</p>
          <p>Pertains-To relations can be one-to-one,
one-tomany and many-to-one.</p>
          <p>The CLinkaRT task is based on the Italian part of E3C, In the example below we have two Pertains-To
the multilingual European Clinical Case Corpus [11], a relations between two EVENTS, i.e. a laboratory test
collection of clinical cases derived from different sources, (protidemia totale) and a measurement (peso), and their
such as published articles available from PubMed, and results (RMLs), i.e. 4,5 g/dl and 19 Kg respectively.
existing corpora. As such, the dataset encompasses a
variety of clinical disciplines in different hospitals and a Peso corporeo di 19 Kg, protidemia totale 4,5 g/dl
wide range of laboratory tests. / Body weight of 19 Kg, total protidemia 4,5 g/dl</p>
          <p>One of the three sections which make up the E3C cor- 19 Kg -&gt; Peso
pus has been manually annotated with different types of 4,5 g/dl -&gt; protidemia
information, such as:</p>
          <p>More specifically, the CLinkaRT task is based on two
sets of data:
• events (which include laboratory tests, among oth- Both RMLs and test and measurements EVENTS are
ers), temporal expressions and temporal relations, marked as strings of text; notice, however, that tests and
annotated according to THYME [12], an adapta- measurements belong to the TimeML category EVENT
tion of the TimeML framework [13]; and are therefore marked by their syntactic head only (i.e.
• results of laboratory tests and measurements, strictly one token only) while RMLs, as defined within
marked through the RML tag (denfied within the the E3C project, are marked by a whole syntactic chunk
E3C project), and Pertains-To relations holding (one or more tokens).</p>
          <p>between an RML and the event it refers to;
• clinical entities (in particular diseases, syndromes, 3.2. Inter Annotator Agreement
ifndings, signs, symptoms, etc.) listed in medical
taxonomies, which is useful for tasks focusing on
clinical entity recognition and analysis [14].</p>
          <p>All the data used for the task have been (manually)
annotated by expert computational linguists and inter-annotator
agreement has been assessed on ten documents, which
have been annotated by two annotators independently. On
average, each annotator has identified 111 relations.
100001|t|Osvaldo, anni 52, ha una storia di diarrea e calo ponderale che si può far
riferire a due anni prima. Non c’è storia di sanguinamento gastroenterico ed una
RICERCA di sangue occulto fecale è risultata negativa su tre campioni. Ammette
di averci dato dentro con l’alcol in passato, ma da diversi anni è assolutamente
astinente. Ha un diabete, controllato con insulina. Sei anni prima è stato
colecistectomizzato. Gli ESAMI di laboratorio sono normali, se si fa eccezione per
una lieve anemia, così come normali sono lo STUDIO radiologico del piccolo e del
grosso intestino.</p>
          <p>The resulting Dice’s coefcfiient [ 15] is 0.87, which is found in the training set. Additionally, regular expressions
quite high given that agreement between annotators is derived from the training data are used to recognize
varionly considered as such when there is a complete overlap ous result entities that pertain to measurements, typically
in the spans of the source and the target (exact match). represented by values. To establish relationships between
The high agreement between annotators ensures that anno- the recognized entities, a relation is created for each pair
tations throughout the whole dataset are consistent. More of laboratory test/measurement and result entities that
specifically, the inter-annotator agreement is particularly co-occur together within the same sentence.
high when numerical values are present in the RMLs (it The second approach relies on a fine-tuned multilingual
reaches 0.92 in terms of Dice’s coefficient), while it is BERT model3 trained on textual mentions involved in
slightly lower (Dice=0.84) in the case of RMLs without relations within the training data. The implementation of
numerical values. this model has been carried out using the
SimpleTransformer library.4 The model is capable of recognizing both
3.3. Data Distribution Format textual references to laboratory tests and measurements
and their results.</p>
          <p>The annotated data have been provided to the participants
in a format that is in an adaptation of the PubTator format Peso corporeo di 19 Kg, protidemia totale 4,5 g/dl
(see an example in Figure 2). It consists of a straightfor- / Body weight of 19 Kg, total protidemia 4,5 g/dl
ward tab-delimited text file, where every document in the 19 Kg -&gt; Peso
dataset is in a new line preceded by the DOCID and the 4,5 g/dl -&gt; protidemia
|t| marker. A space line is used as an indicator of the In the example above the implemented model identifies
end of the document, followed by the annotated relations: the following mentions, using the IOB annotation where
every relation is in a separate line and is represented as an test events are represented as TST and results as RML:
ordered pair, as in (RML -&gt; EVENT), and each string is
represented by its start and end character offsets.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Baselines</title>
      <sec id="sec-2-1">
        <title>Subsequently, an additional multilingual BERT model</title>
        <p>To improve the assessment of participant systems’ perfor- (configured similarly to the previous BERT model) was
mance, supervised and unsupervised baselines have been ifne-tuned on the annotated relations within the training
used for comparative analysis. These baselines have been data to extract the relationships between the recognized
made available through the GitLab repository.2 laboratory tests and their results in the test data.
Concern</p>
        <p>The supervised baseline was assessed using two differ- ing the training data, both positive and negative examples
ent approaches. were generated for sentences containing at least one
lab</p>
        <p>The first approach is based on vocabulary-transfer from oratory test/measurement and one result entity. For each
training to testing (voc. tran.). In this approach, a system generated example, the entities in the relationship were
is used to recognize textual references to laboratory tests
and measurements present in the test set using the entities
Peso [B-TST] corporeo di 19 [B-RML] Kg
[I-RML], protidemia [B-TST] totale 4,5
[B-RML] g/dl [I-RML])
2https://gitlab.fbk.eu/zanoli/clinkart-baseline.git
marked by adding “[TST]” as both the prefix and suffix Participants explored various (supervised) approaches,
to the laboratory tests and measurements, while “[RML]” including traditional machine learning methods, as well
was used to denote the results. The number of examples as using BERT [16] and its derivative models, and top
generated per sentence was determined by multiplying the Large Language Models (LLMs) such as LLaMA [17]. A
number of laboratory tests by the number of result entities brief overview of each team’s approach is reported below,
present in the sentence. while the corresponding results are reported in Table 1.</p>
        <p>For the test data, the examples to be classified were Simple Ideas: Unlike conventional methods that
exgenerated following a similar process, with the difference tract entities and relations separately in a pipeline, the
that instead of using the entities from the gold standard proposed approach uses a pipeline in which first EVENTS
we used the predicted entities. In the case of the sentence are identified and then the Pertains-to relations are created
reported above, the following examples were generated from those. Several BERT-based models were assessed,
along with their corresponding model predictions including Italian BERT [18] and DistilBERT [19], which
(1=positive, 0=negative): were pre-trained on general topics. Additionally, BioBIT
and MedBIT-R3-plus [20] were evaluated as they were
1 [TST]Peso[TST] corporeo di [RML]19 specifically pre-trained for the medical field. Among these
Kg[RML], protidemia totale 4,5 g/dl models, MedBIT-R3-plus resulted as the best model. To
optimize their performance, the models were fine-tuned
0 [TST]Peso[TST] corporeo di 19 Kg, on an augmented version of the original dataset. This
augprotidemia totale [RML]4,5 g/dl[RML] mentation involved the addition of new sentences derived
from the original ones, wherein random words were
sub0 Peso corporeo di [RML]19 Kg[RML], stituted with similar words in the embedding space. This
[TST]protidemia[TST] totale 4,5 g/dl approach achieved the best results in the task and it also
obtained the highest ranking in the parallel TESTLINK
1 Peso corporeo di 19 Kg, task at IberLEF 2023 [21]. The availability of the
im[TST]protidemia[TST] totale [RML]4,5 plemented code contributes to the reproducibility of the
g/dl[RML] presented results.</p>
        <p>ExtremITA: The team employed a unified neural
The unsupervised baseline uses GPT and OpenAI’s API model to address all the EVALITA 2023 tasks. To achieve
(text-davinci-003). It focuses on one-shot learning, where this, they experimented with two different approaches.
the model receives a single example during inference One approach involved fine-tuning an encoder-decoder
through the prompt. This makes one-shot learning more model, specifically T5 [ 22] pre-trained on Italian texts.
similar to unsupervised learning than supervised learning. The second approach is an instruction-tuned
DecoderThe prompt used for performing this evaluation is: Ho only model based on the LLaMA [17] foundational
modun compito che è quello di estrarre menzioni di test di els. This model was initially trained on Italian translations
laboratorio e dei loro risultati da casi clinici. Ecco of Alpaca [23] instruction data. In both cases, the
modun esempio di testo e output: docId:100998. Nota: els were fine-tuned by using the complete set of datasets
nell’output viene scritto prima il risultato e poi il nome provided by the EVALITA 2023 tasks. Moreover, the
del test. Sono separati da “|”. Ora dammi l’output per CLinkaRT dataset was expanded with annotated
docuil seguente testo.5 Within the prompt, docId:100998 ments derived from the Spanish dataset made available in
represents the annotated document selected from the the TESTLINK task. The model built upon the LLaMA
training dataset as the only example for GPT. model showed strong performance across multiple tasks
at EVALITA 2023, including the CLinkaRT task, where
it ranked second. The implemented code has been made
5. System Descriptions available.</p>
        <p>
          Polimi: The team used a traditional pipeline-based
Eight teams expressed their interest to participate in the approach for relation extraction. The first module
fotask. Eventually, four teams submitted their annotated cused on recognizing entities related to laboratory tests
data, resulting in a total of six runs. After the evaluation and their corresponding measurements. The module was
phase, one team decided to withdraw so we now present implemented using two diverse models: CRF [24] and
the results of four runs submitted by three different teams. UmBERTo [
          <xref ref-type="bibr" rid="ref6">25</xref>
          ]. For training the CRF, a range of lexical
features were used, along with external sources of
knowl5My task is to extract laboratory test mentions and their results from edge like UMLS [
          <xref ref-type="bibr" rid="ref7">26</xref>
          ]. Subsequently, the second module
clinical cases. Here you have an example of a text and its output: aimed at establishing relationships between exams and
tdhoecnIdth:1e0n0a9m98e.oNftohteictee:sti.nTthheeyouartpeusteypoaruatfiresdt wbyrit“e|”t.hNeorewsuglitvaenmd e results by pairing them based on proximity within the
the output for the following text. same sentence. While the CRF method obtained quite
satisfactory results, tokenization issues prevented any results
from being obtained using UmBERTo.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Results</title>
      <sec id="sec-3-1">
        <title>Team</title>
      </sec>
      <sec id="sec-3-2">
        <title>Simple Ideas-BERT</title>
      </sec>
      <sec id="sec-3-3">
        <title>ExtremITA-LLaMA</title>
      </sec>
      <sec id="sec-3-4">
        <title>Polimi-CRF</title>
      </sec>
      <sec id="sec-3-5">
        <title>ExtremITA-T5 n-ary 37.50 30.77</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Discussion</title>
      <sec id="sec-4-1">
        <title>Team</title>
      </sec>
      <sec id="sec-4-2">
        <title>Simple Ideas-BERT</title>
      </sec>
      <sec id="sec-4-3">
        <title>ExtremITA-LLaMA</title>
      </sec>
      <sec id="sec-4-4">
        <title>Polimi-CRF</title>
      </sec>
      <sec id="sec-4-5">
        <title>ExtremITA-T5</title>
        <sec id="sec-4-5-1">
          <title>Both traditional machine learning and more recent deep</title>
          <p>learning models were tested for relation extraction. It is
Baseline Type Pr Re 1 worth noting that all participating systems were based on
mBERT S 61.37 64.37 62.83 supervised approaches. Additionally, every system
outperGPT U 29.55 48.73 36.79 formed the vocabulary transfer baseline, which represents
voc. tran. S 29.95 31.86 30.88 the threshold below which systems are not expected to
perform.</p>
          <p>
            Table 2 Surprisingly, none of the teams attempted to
evaluvPirseecdis(iSon),aRndecuanllsaunpderv1isemde(aUs)urbeaosebltianiense.d by the super- ate few-shot learning with LLMs such as GPT [
            <xref ref-type="bibr" rid="ref8">27</xref>
            ] or
LLaMA [17]. However, ExtremITA did evaluate LLaMA,
but instead of employing few-shot learning, they opted
          </p>
          <p>We additionally evaluated systems’ performance (in for a fine-tuning approach, refining the model using the
terms of 1 measure) based on two different dimensions. available training data.</p>
          <p>Table 3 shows the results distinguishing two categories of The assessment of the GPT-based baseline highlights
relations, i.e. n-ary relations (one-to-many and many-to- the present understanding that few-shot learning cannot
one) and one-to-one relations. Table 4 presents separate be considered a viable alternative to fine-tuning in the
conresults for relations involving numerical RMLs and non- text of the present task. Fine-tuning, although requiring
numerical RMLs. Finally, Table 5 reports the accuracy annotated data, produces significantly better results.
of participant systems in the recognition of RMLs and Despite using different pre-trained models trained on
EVENTs, i.e. the sources and targets of the relations. diverse domain-specific data (generic domain vs medical
domain), the top-performing team (Simple Ideas), along
with the second-placed team (ExtremITA) and the baseline
model based on multilingual BERT (mBERT), achieved
remarkably similar results.
6https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/tra CRF (Polimi), as the exclusive traditional machine
ck-3-cdr/ learning algorithm involved in the task, obtained a
preci</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>8. Conclusions</title>
      <p>sion (70.34) in line with that of the top models (71.10).</p>
      <p>Nevertheless, its relatively lower recall (27.12), in
comparison to the recall of the best-performing models (60.62), Extracting laboratory tests and measurements and their
results in moderately satisfactory outcomes in terms of 1 results from clinical narratives seems to be a challenging
score (39.15). task in clinical information extraction. The great variety</p>
      <p>One team (Simple Ideas) conducted an evaluation of tests and the fact that most results contain numerical
of their pipeline-based approach in two distinct tasks: values differentiate this task from most entity recognition
the Italian CLinkaRT task (1 62.99) and the parallel and linking tasks. Participant systems have achieved good
TESTLINK task at IberLEF 2023, focusing on Basque results but there is still room for improvement, especially
(1 72.65) and Spanish (1 68.38). Interestingly, this ap- as far as recall is concerned. As this was the first time that
proach demonstrated superior performance results across we were proposing this task, we decided to keep it strictly
all three languages. focused on relations between tests and their results, but in</p>
      <p>Based on the outcome of our analysis of systems’ per- the future it might be interesting to integrate this task in a
formance in relation to two distinct diagonal dimensions, more complex information extraction effort that considers
i.e. n-ary and one-to-one relations on one hand, and nu- a wider range of clinical entities and relations.
merical and non-numerical RMLs on the other (see
Tables 3 and 4, we can observe that extracting n-ary relations
is more challenging than extracting one-to-one relations, Acknowledgments
which is not surprising. Moreover, the task of extracting This work has been partially funded by the Basque
Govrelations involving numerical RMLs seems easier than ex- ernment postdoctoral grant POS 2022 2 0024.
tracting relations involving non-numerical entities, which
may be correlated to the lower agreement obtained on the
latter in the IAA test. References</p>
      <p>An analysis of the entities involved in the relations
extracted by the participants’ systems shows that recognising
EVENTs seems to be generally harder than recognising
RMLs (Table 5). One possible explanation for this is that
EVENTs are commonly identified by their syntactic head
(leaving out the other elements in the phrase) which can
sometimes be quite challenging.</p>
      <p>Participants report two key reasons for the incorrect
tagging produced by their models. On one hand, BERT
tokenizers struggle splitting correctly medical terms (e.g.
antitrombina -&gt; anti trombina), which leads to wrongly
setting the boundaries of the annotations. In addition, the
difficulty of capturing the most peripheral elements in the
entity mentions has also been a cause for failing to detect
correctly the entity spans. This is the case of “punte di
[circa 1200 pg/ml]” or “pari a 0 o [inferiori a 1.5 mg/dl]”
in which only the tokens between the brackets have been
annotated by the systems.</p>
      <p>The results obtained did not allow us to determine
whether the task being examined is inherently more
dififcult in one language compared to other languages due
to language-specific traits. Within this framework, the
vocabulary transfer baseline, which is expected to provide
a preliminary indication of the task’s difcfiulty, achieves
better results on the Italian CLinkaRT task (1 30.88)
compared to the parallel TESTLINK task for Basque (1
23.96) and Spanish (1 22.10). However, the
participating systems, such as the Simple Idea’s system, showed
contrasting results.
[6] D. Newman-Griffis, G. Divita, B. Desmet, of disorders in clinical texts, Natural Language
A. Zirikly, C. P. Rosé, E. Fosler-Lussier, Ambi- Engineering (2023) 1–19. doi:10.1017/S13513
guity in medical concept normalization: An anal- 24923000335.
ysis of types and coverage in electronic health [15] L. R. Dice, Measures of the amount of ecologic
record datasets, Journal of the American Med- association between species, Ecology 26 (1945)
ical Informatics Association 28 (2020) 516–532. 297–302. URL: http://www.jstor.org/pss/1932409.
URL: https://doi.org/10.1093/jamia/ocaa269. [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
doi:10.1093/jamia/ocaa269. BERT: Pre-training of deep bidirectional
transform[7] G. Alfattni, N. Peek, G. Nenadic, Extraction of ers for language understanding, in: Proceedings
temporal relations from clinical free text: A sys- of the 2019 Conference of the North American
tematic review of current approaches, Journal of Chapter of the Association for Computational
LinBiomedical Informatics 108 (2020) 103488. URL: guistics: Human Language Technologies, Volume
https://www.sciencedirect.com/science/article/pii/ 1 (Long and Short Papers), Association for
CompuS1532046420301167. doi:https://doi.org/ tational Linguistics, Minneapolis, Minnesota, 2019,
10.1016/j.jbi.2020.103488. pp. 4171–4186. URL: https://aclanthology.org/N19
[8] T. Hao, H. Liu, C. Weng, Valx: A System for Ex- -1423. doi:10.18653/v1/N19-1423.
tracting and Structuring Numeric Lab Test Compar- [17] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
ison Statements from Text, Methods of information M. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
in medicine 55 (2016) 266–75. doi:10.3414/ME E. Hambro, F. Azhar, A. Rodriguez, A. Joulin,
15-01-0112. E. Grave, G. Lample, Llama: Open and efcfiient
[9] B. Percha, Modern Clinical Text Mining: A Guide foundation language models, CoRR abs/2302.13971
and Review, Annual Review of Biomedical Data (2023). URL: https://doi.org/10.48550/arXiv.230
Science 4 (2021) 165–187. URL: https://doi.or 2.13971. doi:10.48550/arXiv.2302.13971.
g/10.1146/annurev- biodatasci- 030421- 030931. arXiv:2302.13971.
doi:10.1146/annurev-biodatasci-030 [18] S. Schweter, Italian BERT and ELECTRA models.
421-030931, pMID: 34465177. Version 1.0.1, 2020. URL: https://doi.org/10.528
[10] M. Lai, S. Menini, M. Polignano, V. Russo, 1/zenodo.4263142. doi:10.5281/zenodo.426
R. Sprugnoli, G. Venturi, Evalita 2023: Overview of 3142.
the 8th evaluation campaign of natural language pro- [19] V. Sanh, L. Debut, J. Chaumond, T. Wolf,
Distilcessing and speech tools for italian, in: Proceedings BERT, a distilled version of BERT: smaller, faster,
of the Eighth Evaluation Campaign of Natural Lan- cheaper and lighter, in: 5th Workshop on Energy
guage Processing and Speech Tools for Italian. Fi- Efficient Machine Learning and Cognitive
Computnal Workshop (EVALITA 2023), CEUR.org, Parma, ing @ NeurIPS 2019, 2019. URL: http://arxiv.org/
Italy, 2023. abs/1910.01108.
[11] B. Magnini, B. Altuna, A. Lavelli, M. Speranza, [20] T. M. Buonocore, C. Crema, A. Redolfi, R.
BelR. Zanoli, The E3C Project: Collection and Annota- lazzi, E. Parimbelli, Localising in-domain
adaptation of a Multilingual Corpus of Clinical Cases, in: tion of transformer-based biomedical language
modProceedings of the Seventh Italian Conference on els, ArXiv abs/2212.10422 (2022).
Computational Linguistics, Associazione Italiana di [21] B. Altuna, R. Agerri, L. Salas-Espejo, J. J. Saiz,
Linguistica Computazionale, Bologna, Italy, 2020. R. Zanoli, M. Speranza, B. Magnini, A. Lavelli,
URL: http://ceur-ws.org/Vol-2769/paper_55.pdf. G. Karunakaran, Overview of TESTLINK at
Iber[12] W. F. Styler, S. Bethard, S. Finan, M. Palmer, LEF 2023: Linking Results to Clinical
LaboraS. Pradhan, P. C. de Groen, B. Erickson, T. Miller, tory Tests and Measurements, Procesamiento del
C. Lin, G. Savova, et al., Temporal Annotation in the Lenguaje Natural 71 (2023).</p>
      <p>Clinical Domain, Transactions of the Association [22] G. Sarti, M. Nissim, IT5: Large-scale Text-to-text
for Computational Linguistics 2 (2014) 143–154. Pretraining for Italian Language Understanding and
URL: http://aclweb.org/anthology/Q14-1012. Generation, ArXiv preprint 2203.03759 (2022).
[13] J. Pustejovsky, J. M. Castaño, R. Ingria, R. Saurí, URL: https://arxiv.org/abs/2203.03759.</p>
      <p>R. J. Gaizauskas, A. Setzer, G. Katz, D. R. Radev, [23] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
TimeML: Robust Specification of Event and Tempo- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford
ral Expressions in Text, New directions in question Alpaca: An Instruction-following LLaMA model,
answering 3 (2003) 28–34. URL: http://www.time https://github.com/tatsu-lab/stanford_alpaca, 2023.
ml.org/publications/timeMLpubs/IWCS-v4.pdf. [24] A. McCallum, W. Li, Early results for named entity
[14] R. Zanoli, A. Lavelli, D. Verdi do Amarante, D. Toti, recognition with conditional random fields, feature
Assessment of the E3C corpus for the recognition induction and web-enhanced lexicons, in:
Proceed</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prajapati</surname>
          </string-name>
          ,
          <article-title>NLP/Deep Learning Techniques in Healthcare for Decision Making, Primary Health Care 11 (</article-title>
          <year>2021</year>
          ). URL: https://www.iomcwo rld.org/open-access/
          <article-title>nlpdeep-learning-techniques-i n-healthcare-for-decision-making-66608</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sankoh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Byass</surname>
          </string-name>
          ,
          <article-title>Cause-specific mortality at INDEPTH Health and Demographic Surveillance System Sites in Africa and Asia: concluding synthesis</article-title>
          ,
          <source>Global health action 7</source>
          (
          <year>2014</year>
          ). doi:
          <volume>10</volume>
          .3402/ gha.v7.
          <fpage>25590</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>L.-w. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghassemi</surname>
            , B. Moody, P. Szolovits,
            <given-names>L. Anthony</given-names>
          </string-name>
          <string-name>
            <surname>Celi</surname>
          </string-name>
          , R. G.
          <article-title>Mark, MIMICIII, a freely accessible critical care database</article-title>
          ,
          <source>Scientific Data</source>
          <volume>3</volume>
          (
          <year>2016</year>
          ). URL: http://www.timeml.org/p ublications/timeMLpubs/IWCS-v4.
          <fpage>pdf</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Trigueros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lebeña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Casillas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Explainable ICD multi-label classification of EHRs in Spanish with convolutional attention</article-title>
          ,
          <source>International Journal of Medical Informatics</source>
          <volume>157</volume>
          (
          <year>2022</year>
          )
          <article-title>104615</article-title>
          . URL: https://www.sciencedirec t.com/science/article/pii/S1386505621002410. doi:https://doi.org/10.1016/j.ijme dinf.
          <year>2021</year>
          .
          <volume>104615</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Santiso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Casillas</surname>
          </string-name>
          ,
          <article-title>Adverse Drug Reaction extraction: Tolerance to entity recognition errors and sub-domain variants</article-title>
          ,
          <source>Computer Methods and Programs in Biomedicine</source>
          <volume>199</volume>
          (
          <year>2021</year>
          )
          <article-title>105891</article-title>
          . URL: https://www.sciencedirect.com/science/arti cle/pii/S0169260720317247. doi:https://doi. org/10.1016/j.cmpb.
          <year>2020</year>
          .
          <volume>105891</volume>
          .
          <source>ings of the Seventh Conference on Natural Language Learning at HLT-NAACL</source>
          <year>2003</year>
          ,
          <year>2003</year>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>191</lpage>
          . URL: https://aclanthology.org/W03-0430.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          ,
          <article-title>How "BERTology" Changed the State-of-the-Art also for Italian NLP</article-title>
          , in: J.
          <string-name>
            <surname>Monti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>dell'Orletta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Tamburini</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Seventh Italian Conference on Computational Linguistics</source>
          , CLiC-it
          <year>2020</year>
          , Bologna, Italy, March 1-
          <issue>3</issue>
          ,
          <year>2021</year>
          , volume
          <volume>2769</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          . URL: http: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2769</volume>
          /paper_79.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          ,
          <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
          .,
          <source>Nucleic Acids Res</source>
          .
          <volume>32</volume>
          (
          <year>2004</year>
          )
          <fpage>267</fpage>
          -
          <lpage>270</lpage>
          . URL: http://dblp.uni-trier.de/db/journals/nar/nar32.html #Bodenreider04.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Winter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , E. Sigler,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceeding s.neurips.cc/paper_files/paper/2020/file/1457c0d6 bfcb4967418bfb8ac142f64a-Paper.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>