=Paper=
{{Paper
|id=Vol-3740/paper-11
|storemode=property
|title=Team NOVA LINCS @ BIOASQ12 MultiCardioNER Track: Entity Recognition with Additional
Entity Types
|pdfUrl=https://ceur-ws.org/Vol-3740/paper-11.pdf
|volume=Vol-3740
|authors=Rodrigo Gonçalves,André Lamúrias
|dblpUrl=https://dblp.org/rec/conf/clef/GoncalvesL24
}}
==Team NOVA LINCS @ BIOASQ12 MultiCardioNER Track: Entity Recognition with Additional
Entity Types==
Team NOVA LINCS @ BIOASQ12 MultiCardioNER Track:
Entity Recognition with Additional Entity Types
Rodrigo Gonçalves1,* , André Lamúrias1
1
NOVA LINCS, NOVA School of Science and Technology, Lisbon, Portugal
Abstract
This paper presents the contribution of the NOVA LINCS team to the task MultiCardioNER at BioASQ12. BioASQ
is a long-running challenge that focuses mostly on biomedical semantic indexing and question answering (QA).
The specific task of MultiCardioNER focuses on the multilingual adaptation of clinical NER systems to the
cardiology domain. We leverage a state-of-the-art spanish pre-trained model to perform NER on the DisTEMIST
and DrugTEMIST datasets provided by this task, aided by the additional clinical cases corpus, CardioCCC, for
validation. Experiments were done both with just the entity type to which each of those datasets refers to,
and with the additional entity types from other datasets that use the same documents, in order to determine if
any advantage could be obtained by leveraging the knowledge of the additional entities. The models trained
on the combined dataset achieved a very slight and not significant boost in F1-score when compared to their
one-entity counterparts, with all of them lacking heavily in recall, due to pre- and post-processing errors. However,
one run achieved the highest precision of the task (0.9242). Code to reproduce our submission is available at
https://github.com/Rodrigo1771/BioASQ12-MultiCardioNER-NOVALINCS.
Keywords
Cardiology, Clinical Cases, Named Entity Recognition, Language Models, Transfer Learning
1. Introduction
The MultiCardioNER [1] task, part of the BioASQ 2024 challenge [2], focuses on the automatic recogni-
tion of two key clinical concept types: diseases and medications. In this task, the extraction of those
types of entities is specifically performed over cardiology clinical case documents. This represents a
worthwhile effort because, with cardiovascular diseases being the world’s leading cause of death, it’s
imperative that better automatic semantic annotation resources and systems for high impact clinical
domains such as cardiology are developed. Furthermore, this task also represents an effort to aid in the
development of systems that can perform in multiple languages, not just English, which is much needed
as the prevalence and impact of cardiovascular diseases are global, requiring robust and adaptable tools
that can cater to diverse linguistic and clinical environments.
The datasets provided by each of the two MultiCardioNER subtasks, namely DisTEMIST and
DrugTEMIST, both make use of and annotate over the same set of documents, the SPACCC corpus [3].
Additionally, this corpus is also used by both MedProcNER [4] and SympTEMIST [5]. What this means
in practice is that the SPACCC corpus is annotated with four different types of entities: diseases for
DisTEMIST, medications for DrugTEMIST, medical procedures for MedProcNER, and symptoms for
SympTEMIST. Thus, we focused on training two models per each subtask: one solely with the entity
type of that same subtask, and another one with all four entity types. That way, we could evaluate the
benefit, or lack thereof, of having annotations regarding additional entity types. For the second subtask,
we considered only the documents in Spanish.
2. Data
This shared task utilizes two main training datasets, as previously mentioned: the DisTEMIST and
DrugTEMIST datasets. Each of these corpora is composed by the same 1000 documents that belong to
CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
*
Corresponding author.
$ rmg.goncalves@campus.fct.unl.pt (R. Gonçalves); a.lamurias@fct.unl.pt (A. Lamúrias)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Table 1
Statistics of the unprocessed dataset.
Dataset Number of examples/sentences Number of annotated entities
DisTEMIST 16160 10664
DrugTEMIST 16160 2778
SympTEMIST 16160 12196
MedProcNER 16160 14684
CardioCCC-DisTEMIST 17869 10348
CardioCCC-DrugTEMIST 17869 2510
Table 2
Statistics of the processed dataset (the final training and validation files).
Dataset Number of examples/sentences Number of annotated entities
CombinedDataset-Train 27223 39805
DisTEMIST-Train 27223 16675
DrugTEMIST-Train 27223 4199
DisTEMIST-Dev 6806 4262
DrugTEMIST-Dev 6806 1088
the Spanish Clinical Case Corpus (SPACCC), a manually classified collection of clinical case reports
written in the Spanish language. While DisTEMIST is focused on extracting disease mentions, thus
having ENFERMEDAD as its sole and primary label, DrugTEMIST focuses on drug and medication
extraction, with FARMACO being its entity type. Also previously mentioned is the fact that two former
subtasks, MedProcNER and SympTEMIST, also make use of the SPACCC corpus to extract medical
procedures and symptoms, having PROCEDIMIENTO and SINTOMA as their entity types, respectively.
Additionally, the organization also provided a separate cardiology clinical case reports dataset,
CardioCCC, to be used for the domain adaptation part of the task. It contains a total of 508 documents,
split into 258 documents initially intended for validation, and 250 for testing. As expected, these
documents only include annotations for disease and medication mentions, the entity types considered
for this competition. Nevertheless, we used this dataset both for training and validation of the model.
Some statistics on all of these datasets are shown in Table 1.
After combining the four mentioned datasets (apart from CardioCCC) into a single dataset that
includes the annotations of all four entity types, and given the task objective of adapting clinical models
to the cardiology domain specifically, we joined this new combined dataset with CardioCCC, and split
it so that 80% of examples (sentences) were used for training and 20% for validation. This way, we can
include cardiology clinical case reports in the training of the model. Table 2 shows some statistics on
every training and validation set.
Note that the three training sets are all equal between themselves in terms of the example sentences
that they contain. Likewise, the two validation sets have the same exact sentences between them. The
only thing that changes is the annotated entity types in those examples.
3. Methodology and Experiments
For this task, we leveraged the capabilities of the pre-trained model bsc-bio-ehr-eb1 [6], a RoBERTa-
based model [7] trained on a biomedical-clinical corpus in Spanish collected from several sources
totalling more than 1 billion tokens. This model has previously been fine-tuned on similar shared tasks
like CANTEMIST [8] and PharmaCoNER [9], achieving remarkable results.
We conducted two experiments. The first experiment compares the model when fine-tuned on the
1
https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es
Table 3
Internal validation set results (not obtained using the official evaluation library, calculated with the
token-labels.).
Run Precision Recall F1-Score
1_bsc-bio-ehr-es_distemist_4 0.7795 0.8278 0.8029
2_bsc-bio-ehr-es_distemist_1 0.8067 0.8067 0.8067
3_bsc-bio-ehr-es_drugtemist_4 0.9348 0.9485 0.9416
4_bsc-bio-ehr-es_drugtemist_1 0.9430 0.9586 0.9508
combined dataset with the four entities against the model when fine-tuned on just the DisTEMIST
dataset. The second experiment also compares the model when fine-tuned on the combined dataset, but
this time against the model when fine-tuned on just the DrugTEMIST dataset. This brings the total of
runs submitted to four:
• Track 1 (DisTEMIST):
– 1_bsc-bio-ehr-es_distemist_4: bsc-bio-ehr-es trained on the combined dataset
with the 4 entity types (ENFERMEDAD, FARMACO, PROCEDIMIENTO, SINTOMA), and
validated on the validation set for DisTEMIST (only with ENFERMEDAD).
– 2_bsc-bio-ehr-es_distemist_1: trained and validated on the training and validation
sets, respectively, for DisTEMIST (only with ENFERMEDAD).
• Track 2 (DrugTEMIST), Spanish subset:
– 3_bsc-bio-ehr-es_drugtemist_4: bsc-bio-ehr-es trained on the combined dataset
with the 4 entity types (ENFERMEDAD, FARMACO, PROCEDIMIENTO, SINTOMA), and
validated on the validation set for DrugTEMIST (only with FARMACO).
– 4_bsc-bio-ehr-es_drugtemist_1: trained and validated on the training and validation
sets, respectively, for DrugTEMIST (only with FARMACO).
During training, the best checkpoint was kept, when evaluated on the validation set after each epoch.
The model’s hyperparameters for all four runs were the following:
• Learning rate: 5e-05
• Total train batch size: 16
• Epochs: 10
4. Results and Discussion
The results for each run obtained on our validation set, both during the development of our ap-
proach and after submitting it (the latter obtained using the official evaluation library2 , after it
was released) are presented in Tables 3 and 4, respectively. Additionally, the official gold stan-
dard test set results are shown in Table 5. The naming convention for each run is as follows:
{run_id}\_{model_name}\_{dataset}\_{number_of_entity_types_in_training}.
First of all it is important to point out the way in which we obtained our results on the validation
set (Table 3), as well as the final submitted predictions, for the runs where the model was trained on
the combined dataset (specifically runs 1 and 3), as those models naturally predict more entities than
the entity relevant for the track. Thus, on those particular runs, after obtaining the predictions (be
it the final test set predictions or the predictions used to further calculate the validation results), the
ones that corresponded to entity types that did not match the entity type of the track were ignored.
This way, only predictions of the entity type related to the track were considered. For example, in
2
https://github.com/nlp4bia-bsc/multicardioner_evaluation_library
Table 4
Official validation set results (obtained using the official evaluation library, calculated with the entity
labels).
Run Precision Recall F1-Score
1_bsc-bio-ehr-es_distemist_4 0.7436 0.4308 0.5455
2_bsc-bio-ehr-es_distemist_1 0.7769 0.4305 0.554
3_bsc-bio-ehr-es_drugtemist_4 0.8762 0.5919 0.7065
4_bsc-bio-ehr-es_drugtemist_1 0.8812 0.6002 0.7141
Table 5
Official test set results.
Run Precision Recall F1-Score
1_bsc-bio-ehr-es_distemist_4 0.8018 0.3525 0.4897
2_bsc-bio-ehr-es_distemist_1 0.8183 0.3398 0.4802
3_bsc-bio-ehr-es_drugtemist_4 0.9242 0.4965 0.646
4_bsc-bio-ehr-es_drugtemist_1 0.9076 0.4919 0.638
run 1, the predictions related to the entity types FARMACO, PROCEDIMIENTO and SINTOMA were
ignored and only ENFERMEDAD was included in the final metric assessment seen on Table 3 and in
the submitted predictions. Likewise, in run 3, the predictions related to the entity types ENFERMEDAD,
PROCEDIMIENTO and SINTOMA were ignored and only FARMACO was considered.
The significant difference between the recall scores on our internal evaluation and on the official
evaluation of the runs is evident. Usually, in a situation of high precision and low recall, it means
that the system is only retrieving a small portion of the relevant entities (low recall), but when it does
retrieve one, it identifies it correctly the vast majority of the time (high precision). Our first hypothesis
to try to explain what happened was that this was due to our use of the BIO tagging schema. In this
tagging schema, a sequence containing an entity, for example, of type ENFERMEDAD like "hipertensión
pulmonar severa", should be classified as "B-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD". Our
reasoning was that the model could not generalize for entities that were not present in the training
data, specially since we picked the checkpoint that obtained the best token classification and observed
some overfitting when comparing the training and validation losses, shown in Figure 1. This could lead
to the model under or over-classifying entities:
• Under classification of "hipertensión pulmonar severa" - "B-ENFERMEDAD I-ENFERMEDAD O"
(missing the "severa").
• Over classification of "estenosis píloro - duodenal de carácter extrínseco" - should be correctly
classified as "B-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD O O O"
but is overclassified as "B-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD
I-ENFERMEDAD I-ENFERMEDAD I-ENFERMEDAD" (including the additional information "de
carácter extrínseco").
This under and over-classification could then lead to a low recall, where the correct entities are not
extracted and the number of false negatives grows. However, this would also lead to low precision, as
both under and over-classification of entities would also lead to the extraction of incomplete entities
which in turn leads to more false positives, and our models achieved good and even great precision. To
achieve great precision and bad recall, our models would have to extract a low number of entities as to
not increase the number of false positives, and they do as demonstrated by Tables 6 and 7.
Then, through further error analysis, we realized that this might be happening due to our data parsing
approach. Because of how we parsed the datasets (same approach as PlanTL-GOB-ES’s when parsing
1_bsc-bio-ehr-es_distemist_4 2_bsc-bio-ehr-es_distemist_1
Training Loss 0.175 Training Loss
Validation Loss Validation Loss
1.0 Best Model Epoch 0.150 Best Model Epoch
0.8 0.125
0.100
0.6
Loss
Loss
0.075
0.4
0.050
0.2 0.025
0.0 0.000
2 4 6 8 10 2 4 6 8 10
Epoch Epoch
3_bsc-bio-ehr-es_drugtemist_4 4_bsc-bio-ehr-es_drugtemist_1
1.75
0.006
1.50
0.005
1.25
0.004
1.00 Training Loss
Validation Loss
Loss
Loss
Best Model Epoch 0.003
0.75
0.50 0.002
0.25 0.001
Training Loss
Validation Loss
0.00 0.000 Best Model Epoch
2 4 6 8 10 2 4 6 8 10
Epoch Epoch
Figure 1: Training (blue) and validation (green) losses for each run. The red dashed line indicates the epoch in
which the model achieved the best F1-Score during the training and validation phase, and therefore represents
the final chosen checkpoint.
Table 6
The number of predictions (extracted entities) and number of gold standard annotations (entities that should be
extracted) by run, when evaluated on the validation set using the official evaluation library.
Run Number of Predictions Number of Gold Standard Annotations
1_bsc-bio-ehr-es_distemist_4 2469 4262
2_bsc-bio-ehr-es_distemist_1 2362 4262
3_bsc-bio-ehr-es_drugtemist_4 735 1088
4_bsc-bio-ehr-es_drugtemist_1 741 1088
Table 7
The number of predictions (extracted entities) and number of gold standard annotations (entities that should be
extracted) by run, when evaluated on the test set.
Run Number of Predictions Number of Gold Standard Annotations
1_bsc-bio-ehr-es_distemist_4 3466 7884
2_bsc-bio-ehr-es_distemist_1 3274 7884
3_bsc-bio-ehr-es_drugtemist_4 923 1718
4_bsc-bio-ehr-es_drugtemist_1 931 1718
CANTEMIST 3 or PharmaCoNER 4 for NER in order to fine-tune the same model that we used [6]), the
model was not only trained, but also locally evaluated on individual sentences, not entire documents.
This approach differs from the way that the official evaluation library evaluates runs, as the models
3
https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es-cantemist
4
https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es-pharmaconer
Table 8
The average start span of positive entities and of false negative entities, by run, when evaluated on the validation
set using the official evaluation library.
Run Avg. Start Span Pos. Entities Avg. Start Span False Neg. Entities
1_bsc-bio-ehr-es_distemist_4 1015 3908
2_bsc-bio-ehr-es_distemist_1 996 3910
3_bsc-bio-ehr-es_drugtemist_4 1247 4807
4_bsc-bio-ehr-es_drugtemist_1 1250 4889
Table 9
The average start span of positive entities and of false negative entities, by run, when evaluated on the test set.
Run Avg. Start Span Pos. Entities Avg Start Span False Neg. Entities
1_bsc-bio-ehr-es_distemist_4 1028 4767
2_bsc-bio-ehr-es_distemist_1 990 4706
3_bsc-bio-ehr-es_drugtemist_4 1084 5577
4_bsc-bio-ehr-es_drugtemist_1 1088 5538
are expected to predict on the full content of each test document, making the spans for each extracted
entity relative to the whole document, not just to individual sentences.
This conflict of approaches might explain the discrepancy between the quality of the results, par-
ticularly the recall, obtained on our validation set with our evaluation method, and the quality of the
results obtained with the official evaluation library, both on our validation set and on the test set: as
just mentioned, the examples that were fed to the model during training were individual sentences,
meaning that they were overall much shorter (average length of 21 tokens, 124 chars) than the ones
fed to the model when evaluating it on the test set (entire documents, average length of 1013 tokens,
5754 chars). Thus, the small size of the training examples might have induced the model to only extract
entities up until a certain span when fed longer examples (i.e. documents). Furthermore, and possibly
more relevant, the maximum input sequence length of this model is 512 tokens, which means that when
obtaining the test set predictions to submit, with the entire documents as input, the tokens that went
over this limit were not classified by our models.
These two reasons explain the low number of entities extracted, as the majority of the misses occur
near the end of the documents. On the other hand, the results using our evaluation method, during
the model’s training, did not show this low recall because the model was also evaluated on individual
sentences. The information presented in Tables 8 and 9 supports our claims, as the average start span
of (true and false) positive, i.e retrieved entities, is much lower than the average span of false negatives,
i.e. relevant but not retrieved entities.
We believe that this is the reason why our approach showed consistent results across the board on
our internal validation, but failed to replicate them on the official testing: basically the input length
of the model when evaluating it on a document basis being smaller than many documents, with the
additional nuance that the difference in length between the training and the test examples (the first
much shorter than the second) may have induced the model to only predict entities until a certain span.
We plan to re-classify the dev and tests sets using sentences instead of the full document in order to
verify if the recall would improve and become closer to our internal evaluation.
Nevertheless, and focusing on the objective of experiment itself, we can observe that training with
the combined dataset did not show any substantial improvements regarding the F1-Score. In fact, both
tasks showed a barely significant disadvantage on the validation set: for task 1, a drop of 0.38pp on our
evaluation method and of 0.85pp using the official evaluation library, and on task 2 a drop of 0.92pp
for our evaluation method and of 0.76pp using the official library. On the test set, while the precision
for all runs is quite good, even achieving the best precision score for any team on task 2 with run 3
(0.9242), the recall achieved by all runs is very low, which results in poor F1-scores, more specifically
20.64pp and 16.62pp below the mean for tasks 1 and 2 respectively, for our best run from each task.
Furthermore, when checking for the main objective of the experiments, we observe again no substantial
advantage for training with the combined dataset, only increasing the F1-Score by 0.95pp for task 1 and
0.80pp for task 2.
5. Conclusion and Future Work
We employ the state-of-the-art transformer model bsc-bio-ehr-es and explore the hypothesis of if
training with more entity types within biomedical domain would be beneficial for extracting a specific
type, with the results indicating it does not make a significant difference. Furthermore, with the scores
obtained on the test set, we show that this model can achieve results with high precision and with
minimal fine-tuning. However, the model’s poor recall on the test set is noteworthy and likely caused
by the the way in which the data is parsed for training and evaluation.
In the future, we intend to perform hyperparameter optimization on this model, to try to improve
the scores that we obtained for the MultiCardioNER task (both on the DisTEMIST and DrugTEMIST
subtracks), and apply English and Italian biomedical pre-trained language models to the corresponding
subsets of the DrugTEMIST dataset.
6. Acknowledgements
This work is supported by NOVA LINCS ref. UIDB/04516/2020
(https://doi.org/10.54499/UIDB/04516/2020) and ref. UIDP/04516/2020
(https://doi.org/10.54499/UIDP/04516/2020) with the financial support of FCT.IP.
References
[1] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz,
G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger,
Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation
of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková,
A. García Seco de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the
Evaluation Forum, 2024.
[2] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
Association (CLEF 2024), 2024.
[3] A. Intxaurrondo, M. Krallinger, SPACCC, 2018. URL: https://zenodo.org/doi/10.5281/zenodo.1563762.
doi:10.5281/ZENODO.1563762.
[4] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras,
M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at
bioasq 2023, Working Notes of CLEF (2023).
[5] S. Lima-López, E. Farré-Maduell, L. Gasco-Sánchez, J. Rodríguez-Miret, M. Krallinger, Overview of
SympTEMIST at BioCreative VIII: corpus, guidelines and evaluation of systems for the detection
and normalization of symptoms, signs and findings from text, 2023. URL: https://zenodo.org/doi/10.
5281/zenodo.10104547. doi:10.5281/ZENODO.10104547.
[6] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo,
A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained Biomedical Language Models for Clin-
ical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Language Pro-
cessing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL:
https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19.
[7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. URL: http://arxiv.org/abs/1907.
11692, arXiv:1907.11692 [cs].
[8] A. Miranda-Escalada, E. Farré-Maduell, M. Krallinger, Named Entity Recognition, Concept Normal-
ization and Clinical Coding: Overview of the Cantemist Track for Cancer Text Mining in Spanish,
Corpus, Guidelines, Methods and Results, 2020. doi:10.5281/zenodo.3773228.
[9] A. G. Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger, PharmaCoNER:
Pharmacological Substances, Compounds and proteins Named Entity Recognition track, in: Pro-
ceedings of The 5th Workshop on BioNLP Open Shared Tasks, Association for Computational
Linguistics, Hong Kong, China, 2019, pp. 1–10. URL: https://www.aclweb.org/anthology/D19-5701.
doi:10.18653/v1/D19-5701.