1. Introduction and Motivation

CLinkaRT at EVALITA 2023: Overview of the Task on Linking a Lab Result to its Test Event in the Clinical Domain

Begoña Altuna

Goutham Karunakaran

Alberto Lavelli

Bernardo Magnini

Manuela Speranza

Roberto Zanoli

0 0 Fondazione Bruno Kessler , Via Sommarive 18, 38123 Povo , Italy 1 HiTZ Center - Ixa, University of the Basque Country UPV/EHU , Manuel Lardizabal 1, 20018 Donostia , Spain 2 Università di Trento , Trento , Italy

CLinkaRT at EVALITA 2023 is a relation extraction task based on clinical cases taken from the E3C corpus, i.e. Italian written documents reporting statements of a clinical practice. The task consists in identifying clinical results and measures and linking them to the laboratory tests and measurements from which they were obtained. Three teams participated in the task and various supervised machine learning models, both traditional and based on deep learning, were evaluated. In this evaluation, the deep learning models outperformed the traditional ones. Interestingly, none of the teams explored the use of few-shot language modeling. However, the fact that the supervised models significantly outperformed the task baselines implementing few-shot learning shows the crucial role still played by the availability of annotated training data.

eol>Relation Extraction Clinical NLP Named Entity Recogniton Supervised Learning

1. Introduction and Motivation

tus at a certain time of the development of a disorder and are crucial to choose the right diagnosis. From a more There is a growing interest in processing clinical data for technical point of view, processing laboratory tests and tasks of public interest, such as clinical decision making their results also brings up a new perspective on the treat[ 1 ] or monitoring of the health status of a country [ 2 ]. ment of data, since it requires interpreting numeric values While for this purpose large amounts of structured data and ranges and therefore can not be handled as a comare needed, the reality is that most clinical data are stored mon named entity recognition task [9]. In this context, as free unstructured clinical texts. Hence, the ability of the CLinkaRT task (LINKing A Result to its Test in the extracting information directly from natural language texts CLINnical domain) in EVALITA 2023 [10] provides an and to increase the volume of databases and structured opportunity to evaluate different Natural Language Prodatasets, such as MIMIC-III [ 3 ], is crucial. cessing approaches and does this with a focus on Italian,

Having these goals into account, scholars have devel- a less explored language than English. oped a series of resources for information extraction from clinical texts. Clinical information extraction efforts have often given priority to the identification of diseases [ 4 ] 2. Task Description or events [ 5 ]. As far as the extraction of relations from clinical texts is concerned, previous work has focused The CLinkaRT task consists in identifying textual menon concept normalization [6] and temporal relations [7], tions of both laboratory tests and measurements in a clinamong others. Laboratory tests and measurements and ical narrative, and then linking these to their respective their results have been given little attention [8], although results. Clinical narratives (or clinical cases) are docthey provide interesting information on the patients’ sta- uments reporting statements of a clinical practice, presenting the reason for a clinical visit, the description of EVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- physical exams, the assessment of the patient’s situation *ceCsosrinregspaonnddSinpgeeacuhthTooro.ls for Italian, Sep 7 – 8, Parma, IT aLnadbotrhaetodriya gtensotssiasn,dasmweaeslluraesmtehnetsfoalrleowcoinmgmtroenaltymdeonntse. †$Thbeesgeoanuat.halotrusncao@netrhibuu.etueds (eBq.uaAlllytu.na); as part of this process and are typically documented in goutham.karunakaran@studenti.unitn.it (G. Karunakaran); clinical narratives. lavelli@fbk.eu (A. Lavelli); magnini@fbk.eu (B. Magnini); Figure 1 presents an excerpt of a clinical case where labmanspera@fbk.eu (M. Speranza); zanoli@fbk.eu (R. Zanoli) oratory tests have been marked in bold1 and their results 0000-0002-4027-2014 (B. Altuna)

CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutiRon 4W.0Iontrekrnsathioonapl(CPCrBoYc4e.0e)d.ings (CEUR-WS.org) 1Note that the head of the mention is capitalized.

Osvaldo, anni 52, ha una storia di diarrea e calo

ponderale che si può far riferire a due anni prima. Non c’è storia di sanguinamento gastroenterico ed una RICERCA di sangue occulto fecale è risultata negativa su tre campioni. Ammette di averci dato dentro con l’alcol in passato, ma da diversi anni è assolutamente astinente. Ha un diabete, controllato con insulina. Sei anni prima è stato colecistectomizzato. Gli ESAMI di laboratorio sono normali, se si fa eccezione per una lieve anemia, così come normali sono lo STUDIO radiologico del piccolo e del grosso intestino.

In this example, we have the following Pertains-To re

lations that participants needed to identify between results and tests: • negative / negativa -> the research for fecal blood

/ una RICERCA di sangue occulto fecale • normal / normali -> the laboratory exams / gli

ESAMI di laboratorio • normal / normali -> the radiological study of the bowels / lo STUDIO radiologico del piccolo e del grosso intestino 3. Dataset 3.1. Annotation Among all the annotations foreseen by the E3C project, the data used for CLinkaRT contain the following annotations: • Laboratory test and measurement EVENTS: they include medical procedures in which parts of the body or bodily substances (blood, urine, etc.) are analyzed, as well as different acts of measuring, such as measuring patients’ physical features (e.g.

height and weight) or the size of a lesion or mass. • RMLs are the results of lab tests and measurements; they can consist of a text string (e.g. normal / normali) but more often contain numerical values, typically followed by a unit of measure (e.g. 7,5 g/dl); • Pertains-To relations connecting an RML (the source) to the relevant EVENT (the target).

Pertains-To relations can be one-to-one, one-tomany and many-to-one.

The CLinkaRT task is based on the Italian part of E3C, In the example below we have two Pertains-To the multilingual European Clinical Case Corpus [11], a relations between two EVENTS, i.e. a laboratory test collection of clinical cases derived from different sources, (protidemia totale) and a measurement (peso), and their such as published articles available from PubMed, and results (RMLs), i.e. 4,5 g/dl and 19 Kg respectively. existing corpora. As such, the dataset encompasses a variety of clinical disciplines in different hospitals and a Peso corporeo di 19 Kg, protidemia totale 4,5 g/dl wide range of laboratory tests. / Body weight of 19 Kg, total protidemia 4,5 g/dl

One of the three sections which make up the E3C cor- 19 Kg -> Peso pus has been manually annotated with different types of 4,5 g/dl -> protidemia information, such as:

More specifically, the CLinkaRT task is based on two sets of data: • events (which include laboratory tests, among oth- Both RMLs and test and measurements EVENTS are ers), temporal expressions and temporal relations, marked as strings of text; notice, however, that tests and annotated according to THYME [12], an adapta- measurements belong to the TimeML category EVENT tion of the TimeML framework [13]; and are therefore marked by their syntactic head only (i.e. • results of laboratory tests and measurements, strictly one token only) while RMLs, as defined within marked through the RML tag (denfied within the the E3C project, are marked by a whole syntactic chunk E3C project), and Pertains-To relations holding (one or more tokens).

between an RML and the event it refers to; • clinical entities (in particular diseases, syndromes, 3.2. Inter Annotator Agreement ifndings, signs, symptoms, etc.) listed in medical taxonomies, which is useful for tasks focusing on clinical entity recognition and analysis [14].

All the data used for the task have been (manually) annotated by expert computational linguists and inter-annotator agreement has been assessed on ten documents, which have been annotated by two annotators independently. On average, each annotator has identified 111 relations. 100001|t|Osvaldo, anni 52, ha una storia di diarrea e calo ponderale che si può far riferire a due anni prima. Non c’è storia di sanguinamento gastroenterico ed una RICERCA di sangue occulto fecale è risultata negativa su tre campioni. Ammette di averci dato dentro con l’alcol in passato, ma da diversi anni è assolutamente astinente. Ha un diabete, controllato con insulina. Sei anni prima è stato colecistectomizzato. Gli ESAMI di laboratorio sono normali, se si fa eccezione per una lieve anemia, così come normali sono lo STUDIO radiologico del piccolo e del grosso intestino.

The resulting Dice’s coefcfiient [ 15] is 0.87, which is found in the training set. Additionally, regular expressions quite high given that agreement between annotators is derived from the training data are used to recognize varionly considered as such when there is a complete overlap ous result entities that pertain to measurements, typically in the spans of the source and the target (exact match). represented by values. To establish relationships between The high agreement between annotators ensures that anno- the recognized entities, a relation is created for each pair tations throughout the whole dataset are consistent. More of laboratory test/measurement and result entities that specifically, the inter-annotator agreement is particularly co-occur together within the same sentence. high when numerical values are present in the RMLs (it The second approach relies on a fine-tuned multilingual reaches 0.92 in terms of Dice’s coefficient), while it is BERT model3 trained on textual mentions involved in slightly lower (Dice=0.84) in the case of RMLs without relations within the training data. The implementation of numerical values. this model has been carried out using the SimpleTransformer library.4 The model is capable of recognizing both 3.3. Data Distribution Format textual references to laboratory tests and measurements and their results.

The annotated data have been provided to the participants in a format that is in an adaptation of the PubTator format Peso corporeo di 19 Kg, protidemia totale 4,5 g/dl (see an example in Figure 2). It consists of a straightfor- / Body weight of 19 Kg, total protidemia 4,5 g/dl ward tab-delimited text file, where every document in the 19 Kg -> Peso dataset is in a new line preceded by the DOCID and the 4,5 g/dl -> protidemia |t| marker. A space line is used as an indicator of the In the example above the implemented model identifies end of the document, followed by the annotated relations: the following mentions, using the IOB annotation where every relation is in a separate line and is represented as an test events are represented as TST and results as RML: ordered pair, as in (RML -> EVENT), and each string is represented by its start and end character offsets.

4. Baselines Subsequently, an additional multilingual BERT model

To improve the assessment of participant systems’ perfor- (configured similarly to the previous BERT model) was mance, supervised and unsupervised baselines have been ifne-tuned on the annotated relations within the training used for comparative analysis. These baselines have been data to extract the relationships between the recognized made available through the GitLab repository.2 laboratory tests and their results in the test data. Concern

The supervised baseline was assessed using two differ- ing the training data, both positive and negative examples ent approaches. were generated for sentences containing at least one lab

The first approach is based on vocabulary-transfer from oratory test/measurement and one result entity. For each training to testing (voc. tran.). In this approach, a system generated example, the entities in the relationship were is used to recognize textual references to laboratory tests and measurements present in the test set using the entities Peso [B-TST] corporeo di 19 [B-RML] Kg [I-RML], protidemia [B-TST] totale 4,5 [B-RML] g/dl [I-RML]) 2https://gitlab.fbk.eu/zanoli/clinkart-baseline.git marked by adding “[TST]” as both the prefix and suffix Participants explored various (supervised) approaches, to the laboratory tests and measurements, while “[RML]” including traditional machine learning methods, as well was used to denote the results. The number of examples as using BERT [16] and its derivative models, and top generated per sentence was determined by multiplying the Large Language Models (LLMs) such as LLaMA [17]. A number of laboratory tests by the number of result entities brief overview of each team’s approach is reported below, present in the sentence. while the corresponding results are reported in Table 1.

For the test data, the examples to be classified were Simple Ideas: Unlike conventional methods that exgenerated following a similar process, with the difference tract entities and relations separately in a pipeline, the that instead of using the entities from the gold standard proposed approach uses a pipeline in which first EVENTS we used the predicted entities. In the case of the sentence are identified and then the Pertains-to relations are created reported above, the following examples were generated from those. Several BERT-based models were assessed, along with their corresponding model predictions including Italian BERT [18] and DistilBERT [19], which (1=positive, 0=negative): were pre-trained on general topics. Additionally, BioBIT and MedBIT-R3-plus [20] were evaluated as they were 1 [TST]Peso[TST] corporeo di [RML]19 specifically pre-trained for the medical field. Among these Kg[RML], protidemia totale 4,5 g/dl models, MedBIT-R3-plus resulted as the best model. To optimize their performance, the models were fine-tuned 0 [TST]Peso[TST] corporeo di 19 Kg, on an augmented version of the original dataset. This augprotidemia totale [RML]4,5 g/dl[RML] mentation involved the addition of new sentences derived from the original ones, wherein random words were sub0 Peso corporeo di [RML]19 Kg[RML], stituted with similar words in the embedding space. This [TST]protidemia[TST] totale 4,5 g/dl approach achieved the best results in the task and it also obtained the highest ranking in the parallel TESTLINK 1 Peso corporeo di 19 Kg, task at IberLEF 2023 [21]. The availability of the im[TST]protidemia[TST] totale [RML]4,5 plemented code contributes to the reproducibility of the g/dl[RML] presented results.

ExtremITA: The team employed a unified neural The unsupervised baseline uses GPT and OpenAI’s API model to address all the EVALITA 2023 tasks. To achieve (text-davinci-003). It focuses on one-shot learning, where this, they experimented with two different approaches. the model receives a single example during inference One approach involved fine-tuning an encoder-decoder through the prompt. This makes one-shot learning more model, specifically T5 [ 22] pre-trained on Italian texts. similar to unsupervised learning than supervised learning. The second approach is an instruction-tuned DecoderThe prompt used for performing this evaluation is: Ho only model based on the LLaMA [17] foundational modun compito che è quello di estrarre menzioni di test di els. This model was initially trained on Italian translations laboratorio e dei loro risultati da casi clinici. Ecco of Alpaca [23] instruction data. In both cases, the modun esempio di testo e output: docId:100998. Nota: els were fine-tuned by using the complete set of datasets nell’output viene scritto prima il risultato e poi il nome provided by the EVALITA 2023 tasks. Moreover, the del test. Sono separati da “|”. Ora dammi l’output per CLinkaRT dataset was expanded with annotated docuil seguente testo.5 Within the prompt, docId:100998 ments derived from the Spanish dataset made available in represents the annotated document selected from the the TESTLINK task. The model built upon the LLaMA training dataset as the only example for GPT. model showed strong performance across multiple tasks at EVALITA 2023, including the CLinkaRT task, where it ranked second. The implemented code has been made 5. System Descriptions available.

Polimi: The team used a traditional pipeline-based Eight teams expressed their interest to participate in the approach for relation extraction. The first module fotask. Eventually, four teams submitted their annotated cused on recognizing entities related to laboratory tests data, resulting in a total of six runs. After the evaluation and their corresponding measurements. The module was phase, one team decided to withdraw so we now present implemented using two diverse models: CRF [24] and the results of four runs submitted by three different teams. UmBERTo [ 25 ]. For training the CRF, a range of lexical features were used, along with external sources of knowl5My task is to extract laboratory test mentions and their results from edge like UMLS [ 26 ]. Subsequently, the second module clinical cases. Here you have an example of a text and its output: aimed at establishing relationships between exams and tdhoecnIdth:1e0n0a9m98e.oNftohteictee:sti.nTthheeyouartpeusteypoaruatfiresdt wbyrit“e|”t.hNeorewsuglitvaenmd e results by pairing them based on proximity within the the output for the following text. same sentence. While the CRF method obtained quite satisfactory results, tokenization issues prevented any results from being obtained using UmBERTo.

6. Results Team Simple Ideas-BERT ExtremITA-LLaMA Polimi-CRF ExtremITA-T5 n-ary 37.50 30.77 7. Discussion Team Simple Ideas-BERT ExtremITA-LLaMA Polimi-CRF ExtremITA-T5 Both traditional machine learning and more recent deep

learning models were tested for relation extraction. It is Baseline Type Pr Re 1 worth noting that all participating systems were based on mBERT S 61.37 64.37 62.83 supervised approaches. Additionally, every system outperGPT U 29.55 48.73 36.79 formed the vocabulary transfer baseline, which represents voc. tran. S 29.95 31.86 30.88 the threshold below which systems are not expected to perform.

Table 2 Surprisingly, none of the teams attempted to evaluvPirseecdis(iSon),aRndecuanllsaunpderv1isemde(aUs)urbeaosebltianiense.d by the super- ate few-shot learning with LLMs such as GPT [ 27 ] or LLaMA [17]. However, ExtremITA did evaluate LLaMA, but instead of employing few-shot learning, they opted

We additionally evaluated systems’ performance (in for a fine-tuning approach, refining the model using the terms of 1 measure) based on two different dimensions. available training data.

Table 3 shows the results distinguishing two categories of The assessment of the GPT-based baseline highlights relations, i.e. n-ary relations (one-to-many and many-to- the present understanding that few-shot learning cannot one) and one-to-one relations. Table 4 presents separate be considered a viable alternative to fine-tuning in the conresults for relations involving numerical RMLs and non- text of the present task. Fine-tuning, although requiring numerical RMLs. Finally, Table 5 reports the accuracy annotated data, produces significantly better results. of participant systems in the recognition of RMLs and Despite using different pre-trained models trained on EVENTs, i.e. the sources and targets of the relations. diverse domain-specific data (generic domain vs medical domain), the top-performing team (Simple Ideas), along with the second-placed team (ExtremITA) and the baseline model based on multilingual BERT (mBERT), achieved remarkably similar results. 6https://biocreative.bioinformatics.udel.edu/tasks/biocreative-v/tra CRF (Polimi), as the exclusive traditional machine ck-3-cdr/ learning algorithm involved in the task, obtained a preci

8. Conclusions

sion (70.34) in line with that of the top models (71.10).

Nevertheless, its relatively lower recall (27.12), in comparison to the recall of the best-performing models (60.62), Extracting laboratory tests and measurements and their results in moderately satisfactory outcomes in terms of 1 results from clinical narratives seems to be a challenging score (39.15). task in clinical information extraction. The great variety

One team (Simple Ideas) conducted an evaluation of tests and the fact that most results contain numerical of their pipeline-based approach in two distinct tasks: values differentiate this task from most entity recognition the Italian CLinkaRT task (1 62.99) and the parallel and linking tasks. Participant systems have achieved good TESTLINK task at IberLEF 2023, focusing on Basque results but there is still room for improvement, especially (1 72.65) and Spanish (1 68.38). Interestingly, this ap- as far as recall is concerned. As this was the first time that proach demonstrated superior performance results across we were proposing this task, we decided to keep it strictly all three languages. focused on relations between tests and their results, but in

Based on the outcome of our analysis of systems’ per- the future it might be interesting to integrate this task in a formance in relation to two distinct diagonal dimensions, more complex information extraction effort that considers i.e. n-ary and one-to-one relations on one hand, and nu- a wider range of clinical entities and relations. merical and non-numerical RMLs on the other (see Tables 3 and 4, we can observe that extracting n-ary relations is more challenging than extracting one-to-one relations, Acknowledgments which is not surprising. Moreover, the task of extracting This work has been partially funded by the Basque Govrelations involving numerical RMLs seems easier than ex- ernment postdoctoral grant POS 2022 2 0024. tracting relations involving non-numerical entities, which may be correlated to the lower agreement obtained on the latter in the IAA test. References

An analysis of the entities involved in the relations extracted by the participants’ systems shows that recognising EVENTs seems to be generally harder than recognising RMLs (Table 5). One possible explanation for this is that EVENTs are commonly identified by their syntactic head (leaving out the other elements in the phrase) which can sometimes be quite challenging.

Participants report two key reasons for the incorrect tagging produced by their models. On one hand, BERT tokenizers struggle splitting correctly medical terms (e.g. antitrombina -> anti trombina), which leads to wrongly setting the boundaries of the annotations. In addition, the difficulty of capturing the most peripheral elements in the entity mentions has also been a cause for failing to detect correctly the entity spans. This is the case of “punte di [circa 1200 pg/ml]” or “pari a 0 o [inferiori a 1.5 mg/dl]” in which only the tokens between the brackets have been annotated by the systems.

The results obtained did not allow us to determine whether the task being examined is inherently more dififcult in one language compared to other languages due to language-specific traits. Within this framework, the vocabulary transfer baseline, which is expected to provide a preliminary indication of the task’s difcfiulty, achieves better results on the Italian CLinkaRT task (1 30.88) compared to the parallel TESTLINK task for Basque (1 23.96) and Spanish (1 22.10). However, the participating systems, such as the Simple Idea’s system, showed contrasting results. [6] D. Newman-Griffis, G. Divita, B. Desmet, of disorders in clinical texts, Natural Language A. Zirikly, C. P. Rosé, E. Fosler-Lussier, Ambi- Engineering (2023) 1–19. doi:10.1017/S13513 guity in medical concept normalization: An anal- 24923000335. ysis of types and coverage in electronic health [15] L. R. Dice, Measures of the amount of ecologic record datasets, Journal of the American Med- association between species, Ecology 26 (1945) ical Informatics Association 28 (2020) 516–532. 297–302. URL: http://www.jstor.org/pss/1932409. URL: https://doi.org/10.1093/jamia/ocaa269. [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, doi:10.1093/jamia/ocaa269. BERT: Pre-training of deep bidirectional transform[7] G. Alfattni, N. Peek, G. Nenadic, Extraction of ers for language understanding, in: Proceedings temporal relations from clinical free text: A sys- of the 2019 Conference of the North American tematic review of current approaches, Journal of Chapter of the Association for Computational LinBiomedical Informatics 108 (2020) 103488. URL: guistics: Human Language Technologies, Volume https://www.sciencedirect.com/science/article/pii/ 1 (Long and Short Papers), Association for CompuS1532046420301167. doi:https://doi.org/ tational Linguistics, Minneapolis, Minnesota, 2019, 10.1016/j.jbi.2020.103488. pp. 4171–4186. URL: https://aclanthology.org/N19 [8] T. Hao, H. Liu, C. Weng, Valx: A System for Ex- -1423. doi:10.18653/v1/N19-1423. tracting and Structuring Numeric Lab Test Compar- [17] H. Touvron, T. Lavril, G. Izacard, X. Martinet, ison Statements from Text, Methods of information M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, in medicine 55 (2016) 266–75. doi:10.3414/ME E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, 15-01-0112. E. Grave, G. Lample, Llama: Open and efcfiient [9] B. Percha, Modern Clinical Text Mining: A Guide foundation language models, CoRR abs/2302.13971 and Review, Annual Review of Biomedical Data (2023). URL: https://doi.org/10.48550/arXiv.230 Science 4 (2021) 165–187. URL: https://doi.or 2.13971. doi:10.48550/arXiv.2302.13971. g/10.1146/annurev- biodatasci- 030421- 030931. arXiv:2302.13971. doi:10.1146/annurev-biodatasci-030 [18] S. Schweter, Italian BERT and ELECTRA models. 421-030931, pMID: 34465177. Version 1.0.1, 2020. URL: https://doi.org/10.528 [10] M. Lai, S. Menini, M. Polignano, V. Russo, 1/zenodo.4263142. doi:10.5281/zenodo.426 R. Sprugnoli, G. Venturi, Evalita 2023: Overview of 3142. the 8th evaluation campaign of natural language pro- [19] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilcessing and speech tools for italian, in: Proceedings BERT, a distilled version of BERT: smaller, faster, of the Eighth Evaluation Campaign of Natural Lan- cheaper and lighter, in: 5th Workshop on Energy guage Processing and Speech Tools for Italian. Fi- Efficient Machine Learning and Cognitive Computnal Workshop (EVALITA 2023), CEUR.org, Parma, ing @ NeurIPS 2019, 2019. URL: http://arxiv.org/ Italy, 2023. abs/1910.01108. [11] B. Magnini, B. Altuna, A. Lavelli, M. Speranza, [20] T. M. Buonocore, C. Crema, A. Redolfi, R. BelR. Zanoli, The E3C Project: Collection and Annota- lazzi, E. Parimbelli, Localising in-domain adaptation of a Multilingual Corpus of Clinical Cases, in: tion of transformer-based biomedical language modProceedings of the Seventh Italian Conference on els, ArXiv abs/2212.10422 (2022). Computational Linguistics, Associazione Italiana di [21] B. Altuna, R. Agerri, L. Salas-Espejo, J. J. Saiz, Linguistica Computazionale, Bologna, Italy, 2020. R. Zanoli, M. Speranza, B. Magnini, A. Lavelli, URL: http://ceur-ws.org/Vol-2769/paper_55.pdf. G. Karunakaran, Overview of TESTLINK at Iber[12] W. F. Styler, S. Bethard, S. Finan, M. Palmer, LEF 2023: Linking Results to Clinical LaboraS. Pradhan, P. C. de Groen, B. Erickson, T. Miller, tory Tests and Measurements, Procesamiento del C. Lin, G. Savova, et al., Temporal Annotation in the Lenguaje Natural 71 (2023).

Clinical Domain, Transactions of the Association [22] G. Sarti, M. Nissim, IT5: Large-scale Text-to-text for Computational Linguistics 2 (2014) 143–154. Pretraining for Italian Language Understanding and URL: http://aclweb.org/anthology/Q14-1012. Generation, ArXiv preprint 2203.03759 (2022). [13] J. Pustejovsky, J. M. Castaño, R. Ingria, R. Saurí, URL: https://arxiv.org/abs/2203.03759.

R. J. Gaizauskas, A. Setzer, G. Katz, D. R. Radev, [23] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, TimeML: Robust Specification of Event and Tempo- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford ral Expressions in Text, New directions in question Alpaca: An Instruction-following LLaMA model, answering 3 (2003) 28–34. URL: http://www.time https://github.com/tatsu-lab/stanford_alpaca, 2023. ml.org/publications/timeMLpubs/IWCS-v4.pdf. [24] A. McCallum, W. Li, Early results for named entity [14] R. Zanoli, A. Lavelli, D. Verdi do Amarante, D. Toti, recognition with conditional random fields, feature Assessment of the E3C corpus for the recognition induction and web-enhanced lexicons, in: Proceed

[1]

Jain ,

Prajapati , NLP/Deep Learning Techniques in Healthcare for Decision Making, Primary Health Care 11 ( 2021 ). URL: https://www.iomcwo rld.org/open-access/ nlpdeep-learning-techniques-i n-healthcare-for-decision-making-66608 .html.

[2]

Sankoh ,

Byass , Cause-specific mortality at INDEPTH Health and Demographic Surveillance System Sites in Africa and Asia: concluding synthesis , Global health action 7 ( 2014 ). doi: 10 .3402/ gha.v7. 25590 .

[3]

A. E.

Johnson , T. J. Pollard , L.

Shen , L.-w. H.

Lehman , M.

Feng , M.

Ghassemi , B. Moody, P. Szolovits, L. Anthony

Celi , R. G. Mark, MIMICIII, a freely accessible critical care database , Scientific Data 3 ( 2016 ). URL: http://www.timeml.org/p ublications/timeMLpubs/IWCS-v4. pdf .

[4]

Trigueros ,

Blanco ,

Lebeña ,

Casillas ,

Pérez , Explainable ICD multi-label classification of EHRs in Spanish with convolutional attention , International Journal of Medical Informatics 157 ( 2022 ) 104615 . URL: https://www.sciencedirec t.com/science/article/pii/S1386505621002410. doi:https://doi.org/10.1016/j.ijme dinf. 2021 . 104615 .

[5]

Santiso ,

Pérez ,

Casillas , Adverse Drug Reaction extraction: Tolerance to entity recognition errors and sub-domain variants , Computer Methods and Programs in Biomedicine 199 ( 2021 ) 105891 . URL: https://www.sciencedirect.com/science/arti cle/pii/S0169260720317247. doi:https://doi. org/10.1016/j.cmpb. 2020 . 105891 . ings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 , 2003 , pp. 188 - 191 . URL: https://aclanthology.org/W03-0430.

[25]

Tamburini , How "BERTology" Changed the State-of-the-Art also for Italian NLP , in: J. Monti , F.

dell'Orletta , F.

Tamburini (Eds.), Proceedings of the Seventh Italian Conference on Computational Linguistics , CLiC-it 2020 , Bologna, Italy, March 1- 3 , 2021 , volume 2769 of CEUR Workshop Proceedings, CEUR-WS.org , 2020 . URL: http: //ceur-ws. org/ Vol- 2769 /paper_79.pdf.

[26]

Bodenreider , The Unified Medical Language System (UMLS): integrating biomedical terminology ., Nucleic Acids Res . 32 ( 2004 ) 267 - 270 . URL: http://dblp.uni-trier.de/db/journals/nar/nar32.html #Bodenreider04.

[27]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell ,

Agarwal ,

Herbert-Voss , G. Krueger,

Henighan ,

Child ,

Ramesh ,

Ziegler ,

Wu ,

Winter ,

Hesse ,

Chen , E. Sigler,

Litwin ,

Gray ,

Chess ,

Clark ,

Berner ,

McCandlish ,

Radford ,

Sutskever ,

Amodei , Language models are few-shot learners , in: H. Larochelle , M.

Ranzato , R.

Hadsell , M.

Balcan , H. Lin (Eds.), Advances in Neural Information Processing Systems , volume 33 , Curran

Associates

, Inc., 2020 , pp. 1877 - 1901 . URL: https://proceeding s.neurips.cc/paper_files/paper/2020/file/1457c0d6 bfcb4967418bfb8ac142f64a-Paper.pdf.