1. Introduction

EVALITA

Simple Ideas at CLinkaRT: LeaNER and MeaNER Relation Extraction

Marius Micluța-Câmpeanu

marius.micluta-campeanu@unibuc.ro 0 2 3

Liviu P. Dinu

liviu.p.dinu@gmail.com 0 1 2 3 0 Faculty of Mathematics and Computer Science, University of Bucharest , Romania 1 Human Language Technologies Research Center, University of Bucharest , Romania 2 Processing and Speech Tools for Italian , Sep 7 - 8, Parma, IT 3 two consecutive Named Entity Recognition (NER) mod-

2023

8 0000 0002

In this paper, we present our approach for performing relation extraction on clinical texts in the context of the CLinkaRT task at EVALITA 2023. Our system ranked first in this task with an F1-score of 62.99, outperforming most other submissions by a significant margin, with an increase of 6.5% over the second best score of 59.16, while also improving over the mBERT baseline of 62.83. We pursue a simple yet unexplored method to determine sentence level relations in text by relying on Named Entity Recognition models to identify the components of a relation. We apply this method to link laboratory results to their appropriate events in medical reports.

EVALITA CLinkaRT named entity recognition relation extraction transformers

1. Introduction The availability of vast quantities of textual data in the

biomedical domain from digital repositories like PubMed Central has led to the development of highly specialized resources and language models [ 1 ]. Nonetheless, most of these eforts have been focused on English, while other less-resourced languages were largely neglected due to lack of available datasets.

The typical approach for downstream tasks in these languages is to resort to multilingual models, such as mBERT [ 2 ]. The rising need for pretrained models in languages other than English for biomedical applications materialized in the past few years with the advent of BioBIT/MedBIT for Italian [ 3 ] and similar models for other lower-resource languages: Spanish [ 4 ], Turkish [ 5 ] and French [ 6 ].

In the context of creating better systems for Italian, the CLinkaRT shared task [ 7 ] at EVALITA 2023 [ 8 ] challenges participants to detect laboratory measurements and tests from clinical records in order to associate them with their corresponding results. The relevance of developing and improving relation extraction tasks is highlighted in the literature, since it provides the underlying core elements for creating advanced biomedical text mining systems. Some examples include discovering interactions between drugs, adverse efects, genes, chemicals and diseases; predicting inappropriate emergency room visits; generating educational documents; building interaction nEvelop-O a similar system for our submission in the TESTLINK twin task [ 12 ], we provide a shortened description of the implementation and focus more on experiments and ifndings carried specifically in the context of CLinkaRT. 2.2. Implementation details (5) “Gli esami colturali (germi comuni, [T] BK) risultavano negativi.” The first NER model is trained to predict all target entities in a sentence. For instance, in the phrase “La creatinina We encode annotations for both NER models with oscillava tra 1,5–2 mg/dL con proteinuria sempre < 1 standard IOB2 tags (inside, outside, beginning) for either g die” there are two relations: “creatinina” target with sources (RML entities) or targets (EVENT entities). A “1,5–2 mg/dL” as source and “proteinuria” with “< 1 g regular relation extraction pipeline would first employ a die” as source. We begin by locating targets first because NER model to determine sources and targets at the same the annotations mark only the syntactic head of a target, time and then apply a relation classifier on all possible e.g. esami “tests” is an appropriate target for both esami source-target pairs. colturali “culture tests” and esami ematici “blood tests”. With our approach, the first NER model is tasked to

After determining all targets in a sentence, we trans- predict just target entities, while the second NER model form the training examples to incorporate target loca- is trained solely with labels for source entities. The contions directly in text by adding a special marker token sequence is that our models have a lower number of [T] before each target token, which should help the sec- possible labels, determined by fewer IOB2 tags, therefore ond NER model find relevant source entities. This is a improving prediction performance. Each model has 3 viable strategy to denote one-to-one, one-to-many and tags: beggining, inside, outside. While the NER model many-to-one relations between sources and targets, thus used to predict targets only denotes the target head, we efectively eliminating the need of a relation classifier still need “inside” tags due to sub-word splitting required model. The target indicates just the syntactic head, so we by transformer models. Contrast this with a traditional do not add an end marker because it might hinder the sec- pipeline that would have a NER model with 5 tags (2 ond NER model’s ability to properly learn representing beginning tags, 2 inside tags, 1 outside tag) followed by adequate targets. a relation classifier model.

All relation types (one-to-one, one-to-many, many-toone) are handled in a uniform manner. For every target in 2.3. Data augmentation a sentence, we generate one sample with a single target marker [T]. In this regard, there is no diference between The CLinkaRT training dataset consists of 83 Italian doca source with multiple targets and several one-to-one uments with 658 annotated relations. Due to the limrelations. For one target with many sources, only a single ited number of examples, we decide to augment the iniexample is created. This way, we augment our training tial dataset with contextual word embeddings using the data for the second NER model. nlpaug library [ 13 ]. For this process, we replace random

As an example with two one-to-one relations, the sen- words with other similar words in the embedding space, tence “La creatinina oscillava tra 1,5–2 mg/dL con protein- except for labeled tokens, since there is a risk of injecting uria sempre < 1 g die” has two targets, “creatinina” and noisy labels. “proteinuria”. The samples for our second NER model We preserve the annotated entities and the target will be: marker [T] in the augmented examples, ignoring sentences with 9 words or less. Samples that are not identical (1) “La [T] creatinina oscillava tra 1,5–2 mg/dL con in terms of word count are discarded because the original proteinuria sempre < 1 g die” with only “1,5–2 labels would be misplaced.

mg/dL” labeled as source Given that one of our main concern is finding labora(2) “La creatinina oscillava tra 1,5–2 mg/dL con [T] tory tests and measurements, many of these entities are proteinuria sempre < 1 g die” with only “< 1 g numeric values. To further our data augmentation, we die” labeled as source introduce tiny changes of ±2 for decimal values (age, year or quantities with a higher tolerance) and ±0.1 for real

In the following example, there is one source linked values (tests or percentages). In most cases, this process to three targets: “Gli esami colturali (germi comuni, BK) should not significantly alter the labeling. risultavano negativi.”. The source is “negativi”, while the The training set has a much greater number of negatargets are “esami”, “germi” and “BK”. Three sentences tive samples (examples without relations) than positive will be added, all with a single source to be predicted samples. We augment each example one or more times, (“negativi”): with positive instances denoted by in_multiplier and (3) “Gli [T] esami colturali (germi comuni, BK) risul- negative instances by out_multiplier. We use the multavano negativi.” tipliers shown in Table 1, where the NER-tgt model predicts targets and the NER-src model predicts sources. The (4) “Gli esami colturali ([T] germi comuni, BK) risul- second model requires fewer auxiliary examples because tavano negativi.” the preprocessing step created additional samples for Multiplier type in_multiplier out_multiplier

NER-tgt

NER-src sentences with more target entities.

The bottleneck of this augmentation process is the li- for selecting appropriate values for some of the parambrary call that executes the transformation. Considering eters regarding training and augmentation. Although that the operation runs on GPU, it should be natural to the models are trained at the sentence level, this split attempt to speed up this step by augmenting several ex- is by document id so we do not overestimate the model amples in parallel. While the nlpaug library has an API performance on unseen examples. that allows augmenting multiple sentences at once and The main results for relation extraction in the at first it appeared to work on a few samples in the train CLinkaRT task are displayed in Table 2. Our team obtains set, a significant number of augmented examples con- the first place across all teams, with an F1-score of 62.99, structed by nlpaug turned out to be empty sentences due an improvement of 6.5% in F1-score over the second best to limitations or issues of this library. The batch imple- competing team of 59.16 and a slight increase over the mentation required a bit of efort due to the need to apply mBERT baseline with a score of 62.83. We also achieve diferent multipliers. Since this attempted optimization the highest recall of 60.62 among other participants, with did not succeed, we resumed augmenting examples one the second best score of 50.65, while the mBERT baseline by one. has a recall of 64.37. We improve the baseline precision of 61.37 with a 6.8% increase, reaching a score of 65.55. 2.4. Model training and inference We report the performance of our system on the validation set averaged across 10 folds together with the We implement our NER models as standard token clas- oficial results. We used cross-validation to carry out sifiers with the help of HuggingFace Transformers li- parameter and model selection. Besides these 10% rebrary [ 14 ]. We perform fine-tuning on a model pre- served examples for testing, the models also set aside trained on Italian medical textbooks, web-crawled data 10% of the remaining examples for validation and hyperand translated English PubMed abstracts, available as parameter tuning. In spite of these eforts, we notice a IVN-RIN/medBIT-r3-plus on HuggingFace Hub [ 3 ]. possible tendency of overfit. One explanation for this

All of our training experiments are carried out by phenomenon is the small size of the model validation set, mostly preserving default parameters: AdamW optimizer with too few samples to properly adjust the parameters with 5 −5 learning rate with linear decay and no warmup when training. Another related explanation is given by steps, 1 −2 weight decay, 8 samples per batch trained for the high variation between some of the folds, with half 4 epochs, with 10% examples held out for validation. Both of the folds obtaining F1-scores over 82%, while the other models are trained independently using gold labels, with folds consistently scored lower, between 73% and 77% the second NER model (NER-src) receiving the target F1-score. Even so, it might simply be the case that the marker tokens [T] from these gold annotations. test set is intentionally constructed with novel situations

In inference mode, the models are asked to output to determine the performance on unseen data more acsource and target ofsets with respect to the original raw curately, which would justify the gap between test and text. For each sentence converted into an example, we validation. store the ofset of the first token. Since HuggingFace Outside the evaluation window, we repeated the inDatasets library employs a separate tokenization, we ference process a second time on the test set keeping align the transformed concatenated sentences with the the same parameters and achieved 64.09 F1-score, showinitial full texts by using spaCy [ 15 ]. ing that our approach can outrun the other systems by a greater margin than in the oficial results. Still, this 3. Results variation is caused by the nondeterministic nature of transformer networks. We plan to analyze the extent of this variation and to limit the randomness of our system.

We conduct our experiments by creating a test set from the training set with a 90/10 split in order to simulate the final test set, switching to 10-fold cross-validation points. The results of the multilingual model are averaged only over the first 5 folds (out of 10 folds) because the training process takes much longer. We believe the increased training time is not justified given that a much smaller model can achieve better results. For this reason, we do not conduct additional experiments with multilingual transformers. Concerning the BioBIT model, it ofers slightly worse results than the MedBIT-R3-plus counterpart, as noted likewise in the original paper [ 3 ].

The results are summarized in Table 3, using data augmentation for all variants except where noted otherwise.

Due to time constraints, we did not run additional experiments with these models.

In addition to typical false positives and false negatives, we observe types of errors that show the system is on the right track, but fails to output the exact ofsets in the reference file.

There are a couple of incomplete entity spans: MedBIT-R3-plus MedBIT-R3-plus (no aug) Italian BERT DistilBERT BioBIT

4. Discussion In this section, we present some observations regarding

the design choices of our implementation and conduct an error analysis.

The data augmentation process has three main parameters: the minimum number of words that should be replaced in a sentence, min_aug, and the two multipli- (6) The true source is “pari 0 inferiori a 1.5 mg/dl” ers for positive and negative examples defined earlier as and the predicted source is “inferiori a 1.5 mg/dl”. in_multiplier and out_multiplier. We pick values Other similar examples: true source is “fino a 12.8 between 3 and 6 for min_aug, based on the number of mg/dL”, predicted source is “12.8 mg/dL”; true failed replacements, fixing the value at 4 words. The source is “punte di circa 1200 pg/mL”, predicted reasoning behind this decision is that the augmentation source is “circa 1200 pg/mL”. library is sometimes unable to adequately generate valid examples due to misplaced or missing words, so the gold (7) The true target is “antitrombina”, while the prelabels cannot be applied, in turn leading to fewer exam- dicted target is “anti” ples in the train set. The first situation appears due to modifying compara

The multipliers are selected by cross-validation, stop- tives not being present or being scarcely existent in the ping early in case of unsatisfactory results on the first training data, which one could solve with additional exfolds. For out_multiplier, we vary this parameter amples or through careful augmentation. The second between 0 and 2 for both NER models, while for in_- issue seems to be a defect on our side that can be handled multiplier we use values in the range 1–5 for NER-tgt in post-processing by inspecting the initial tokenization. and 1–4 for NER-src. Our experiments confirm that aug- Another common mistake is the prediction of one rementation is also needed for negative samples. This step lation instead of two (or vice versa) in the case of interhas a significant impact in our system, boosting the score vals, which we explain by ambiguities in the training set. on the validation set with over 20 percentage points in For example, our system outputs “1.9 – 2.5 mg/dl” linked F1-score. As it would be expected, adding too many ex- with “creatininemia”, but there are two expected relations: amples by using larger multipliers eventually induces “1.9” linked with “creatininemia” and “2.5 mg/dl” linked overfit. with “creatininemia”. Conversely, our system detects two

The main drawback of augmentation is the slow data relations, “sostanzialmente” linked with “obiettività” and generation. As we mentioned earlier, nlpaug runs se- “nei limiti di norma” linked with “obiettività”, while there quentially, so we had to make educated guesses of what is only one true relation, “sostanzialmente nei limiti di combinations of parameters to include in our experi- norma” linked with “obiettività”. ments. Lastly, a challenging facet of this task is the presence

For most of our experiments, we rely on the model of reference values for some tests, which are picked up called MedBIT-R3-plus [ 3 ] accessible on HuggingFace by our model, although they are not found as gold laHub. In order to determine if this is the right choice, we bels because they do not represent test results. Future briefly examine the efectiveness of other transformer work in this direction should find means to distinguish models. We consider three alternative options: Italian between reference values and actual measurements and BERT [ 16 ], the multilingual version of DistilBERT [ 17 ] test values. and the BioBIT model trained only on medical textbooks [ 3 ]. Undoubtedly, Italian BERT is less suitable for this task, with a substantial drop in F1-score of 14 percentage

5. Conclusion and future work In this paper, we detailed our contribution in the

CLinkaRT task [ 7 ] at EVALITA 2023 [ 8 ], demonstrating that intuitive solutions yield competitive results. Our proposed approach achieves the best F1-score among other systems in the task of correlating laboratory tests and measurements with their results, with a 6.5% improvement in F1-score over the second best contestant.

We present a straightforward strategy to extract sentence-level relations based on two plain NER models, illustrating the learning capabilities of transformer networks to solve challenging tasks with the help of special tokens. We intend to further explore this direction since NER models are well established and usually require fewer resources than alternative relation extraction (RE) models. The presented method is not limited to the clinical domain and it can be easily applied in other contexts, with the added benefit of shorter development cycles. In certain domains and applications, the overhead of a generic RE model may be unjustified if the relations in question are simple enough.

Data augmentation is a valuable, but underused technique in natural language processing contexts. We look forward to enhancing the augmentation procedure to account for in-domain information. Another area we believe to be worth pursuing is the handling of numeric values and ranges, either by finding a way to inject fuzzy intervals or by masking these values altogether, therefore simplifying the initial problem.

Our system implementation is available at https://gitl ab.com/marius.micluta-campeanu/testlink-clinkart-2023 to encourage an open environment for future work.

[1]

Beltagy ,

Lo , A . Cohan, SciBERT: A pretrained language model for scientific text , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , Association for Computational Linguistics , Hong Kong, China, 2019 , pp. 3615 - 3620 . URL: https://aclanthology.org/D19-137 1. doi: 10 .18653/v1/ D19 - 1371.

[2]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/N19 -1423. doi: 10 .18653/v1/ N19 - 1423.

[3]

T. M.

Buonocore ,

Crema ,

Redolfi ,

Bellazzi , E. Parimbelli, Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models , 2022 . arXiv: 2212 . 10422 .

[4]

C. P.

Carrino ,

Armengol-Estapé , A. GutiérrezFandiño, J. Llop-Palao , M. Pàmies , A. GonzalezAgirre, M. Villegas, Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario , 2021 . arXiv: 2109 . 03570 .

[5]

Türkmen ,

Dikenelli ,

Eraslan ,

M. C.

Çalli ,

S. S.

Ozbek , Developing Pretrained Language Models for Turkish Biomedical Domain , in: 2022 IEEE 10th International Conference on Healthcare Informatics (ICHI) , IEEE, 2022 , pp. 597 - 598 .

[6]

Labrak ,

Bazoge ,

Dufour ,

Rouvier ,

Morin ,

Daille , P.-A. Gourraud, DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 16207 - 16221 . URL: https: //aclanthology.org/ 2023 . acl-long . 896 .

[7]

Altuna ,

Karunakaran ,

Lavelli ,

Magnini ,

Speranza , R. Zanoli, CLinkaRT at EVALITA 2023: Overview of the Task on Linking a Lab Result to its Test Event in the Clinical Domain, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[8]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, EVALITA 2023 : Overview of the 8th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[9]

Wang ,

Rastegar-Mojarad ,

Moon ,

Shen ,

Afzal , S. Liu,

Zeng ,

Mehrabi ,

Sohn , H. Liu, Clinical information extraction applications: A literature review , Journal of Biomedical Informatics 77 ( 2018 ) 34 - 49 . URL: https://www.sciencedir ect.com/science/article/pii/S1532046417302563. doi:https://doi.org/10.1016/j.jbi. 2017 . 11 . 011.

[10]

Perera ,

Dehmer ,

Emmert-Streib , Named Entity Recognition and Relation Detection for Biomedical Information Extraction , Frontiers in Cell and Developmental Biology 8 ( 2020 ). URL: https://ww w.frontiersin.org/articles/10.3389/fcell. 2020 . 00673 . doi: 10 .3389/fcell. 2020 . 00673 .

[11]

Magnini ,

Altuna ,

Lavelli ,

Speranza ,

Zanoli , The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases , in: J. Monti , F.

Dell'Orletta , F.

Tamburini (Eds.), Proceedings of the Seventh Italian Conference on Computational Linguistics , volume 2769 of CLiC-It, CEUR-WS, Milan Italy, 2020 , pp. 422 - 431 .

[12]

Altuna ,

Agerri ,

Salas-Espejo ,

J. J.

Saiz ,

Lavelli ,

Magnini ,

Speranza ,

Zanoli , G. Karunakaran, Overview of TESTLINK at IberLEF 2023: Linking Results to Clinical Laboratory Tests and Measurements, Procesamiento del Lenguaje Natural 71 ( 2023 ).

[13]

Ma , NLP Augmentation, https://github.com/makcedward/nlpaug, 2019 .

[14]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest ,

A. M.

Rush , Transformers: State-of-the-Art Natural Language Processing , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , Association for Computational Linguistics , Online, 2020 , pp. 38 - 45 . URL: https:// www.aclweb.org/anthology/2020.emnlp-demos. 6 .

[15]

Montani ,

Honnibal ,

S. V.

Landeghem ,

Boyd ,

Peters ,

P. O.

McCann , jim geovedi, J. O'Regan , M.

Samsonov , G.

Orosz , D. de Kok, M.

Blättermann , D.

Altinok , S. L.

Kristiansen , M.

Kannan , R.

Mitsch , R.

Bournhonesque , Edward, L.

Miranda , P.

Baumgartner , R.

Hudson , E. Bot, Roman, L.

Fiedler , R.

Daniels , W.

Phatthiyaphaibun , G. Howard, Y.

Tamura , spaCy: Industrial-strength Natural Language Processing in Python , 2023 . URL: https://doi.org/10.5281/zenodo .7715077. doi: 10 .5281/zenodo.7715077.

[16]

Schweter , Italian

BERT

and ELECTRA models , 2020 . URL: https://doi.org/10.5281/zenodo.4263142. doi: 10 .5281/zenodo.4263142.

[17]

Sanh ,

Debut ,

Chaumond , T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , ArXiv abs/ 1910 .01108 ( 2019 ).