SINAI at CLEF 2022: Leveraging biomedical
transformers to detect and normalize disease
mentions
Mariia Chizhikova1 , Jaime Collado-Montañez1 , Pilar López-Úbeda2 ,
Manuel C. Díaz-Galiano1 , L. Alfonso Ureña-López1 and M. Teresa Martín-Valdivia1
1
    University of Jaén, Campus Las Lagunillas s/n, 23071, Jaén, Spain
2
    R+D+I department, HT medica, Carmelo Torres nº2, 23007, Jaén, Spain


                                         Abstract
                                         This paper presents the system developed by SINAI team for Disease Text Mining and Indexing Shared
                                         Task at CLEF 2022 BioASQ workshop. This task brings the community effort to development of automatic
                                         disease mention detection and semantic indexing systems for electronic health records written in Spanish.
                                         Our proposal is based on a deep learning RoBERTa architecture model fine-tuned for the named entity
                                         recognition task, which achieved 0.75 micro-average F1-score during the official evaluation. For the
                                         entity linking task, we introduce a system based on a combination of term matching and embedding
                                         similarity calculation which best micro-average F1-score is 0.41.

                                         Keywords
                                         Clinical entity recognition, Clinical entity linking, Biomedical Natural Language Processing, RoBERTa
                                         language model


1. Introduction
Clinical coding stands for transforming medical information from patient’s Electronic Health
Records (EHR) into alphanumeric codes using a classification standard [1]. Nowadays the
interpretation of EHRs and the assignation of standardised codes lies on human coders or even
on physicians themselves. However, the massive amount medical of data that increases with
each patient’s visit has become an excessive burden for human annotators [2]. This led to a
rise in demand for development and improvement of the automatic curation systems capable to
handle massive amounts of EHRs.
   Natural Language Processing (NLP) aims to address the need of managing unstructured data
in order to extract relevant information that makes Information Retrieval (IR) more efficient
or can serve as input for such application as Clinical Decision Support Systems (CDSS), for
example [3].
   Search queries that mention disorders (this refers to diseases, abnormalities, dysfunctions,
syndromes, injuries, etc.) constitute the first most frequent non-bibliographic query type among

CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ mc000051@red.ujaen.es (M. Chizhikova); jcollado@ujaen.es (J. Collado-Montañez); p.lopez@htmedica.com
(P. López-Úbeda); mcdiaz@ujaen.es (M. C. Díaz-Galiano); laurena@ujaen.es (L. A. Ureña-López); maite@ujaen.es
(M. T. Martín-Valdivia)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
PubMed users [4]. The relevance of this category in clinical texts is also very high, which
emphasises the need of creating accurate Named Entity Recognition (NER) and Named Entity
Normalization (NEN) systems to improve information retrieval. This kind of systems would
automatically detect disorder mentions in both scientific and clinical texts subsequently mapping
them to codes in a relevant controlled vocabulary.
   With Bidirectional Encoder Representation from Transformers (BERT)[5], large pre-trained
neural language models of transformer architecture became an essential building block for many
NLP tasks, such as text classification, NER, text summarization, etc. Nevertheless, transfer-
learning capacities of these models depend, among many diverse factors, on the language variety
differences between the pre-training corpora and task’s data. Considering that the vocabulary
and syntax of medical jargon differs from general-domain language, continual pre-training of
a general-domain model was proposed as a way of improving its performance in biomedical
NLP [6]. Despite being beneficial, continual pre-training does not extend the vocabulary of the
original model, which maintains unrepresentative of domain-specific texts. This fact led to the
proposal of domain-specific pre-training from scratch that was proved to be more efficient than
continual pre-training [7].
   Disease Text Mining and Indexing Shared Task (DISTEMIST) at CLEF 2022 BioASQ workshop
brings the community effort to design systems capable of making disorder mention in clinical
text accessible for search systems by identifying them and mapping each one to a code from
the Systematized Nomenclature of Medicine – Clinical Terms (Snomed-CT)1 . Snomed-CT is a
integral multilingual clinical terminology that contains almost 800,000 descriptions, including
synonyms that can be used to refer to a concept, that are linked with semantic relationships.
Moreover, Snomed-CT is called the most comprehensive clinical healthcare terminology in the
world2 .
   In this paper we describe the approach followed by the SINAI team to tackle both NER
and NEN DISTEMIST subtasks. The success of biomedical domain-specific pre-training of
large transformer language models [6] brought us to test two models of the same architecture
that were pre-trained on different corpora [8] to evaluate its performance on NER task. For
the DISTEMIST-linking sub-task we propose a multi-step approach that combines embedding
similarity calculation and term matching.


2. Data
Both DISTEMIST subtasks challenge researchers with real-world datasets, promoting the im-
provement of the state-of-the-art NLP systems for clinical coding[9]. The gold-standard corpus
provided by workshop organization committee is a collection of 1,000 clinical cases in Spanish
from different medical specialities such as cardiology, oncology, otorhinolaryngology, den-
tistry, pediatrics, primary care, allergology, radiology, psychiatry, ophthalmology, and urology
annotated with disease mentions [10].
   DISTEMIST organizers provided a collection of 750 clinical cases for the NER sub-track,
583 of which formed the training set for the NEN sub-track. This training set was annotated
   1
       https://www.snomed.org
   2
       https://www.snomed.org/snomed-ct/five-step-briefing
Figure 1: Descriptions of the 20 most frequent entities and Snomed-CT codes in the corpus (English
translation of entities and code descriptions made only to ease the reading).


                                            Entities   Tokens   Sentences
                                    max        57       1,486      132
                                    min         1         98        6
                                    avg       10.75      457      33.96
                                    SD        6.21     218.28     16.39

Table 1
Corpus statistics.


with 5,348 unique entity mentions and 1,844 unique Snomed-CT codes, being 57 the maximum
number of disease annotations per text. Figure 1 shows descriptions of the 20 most frequently
mentioned entities and Snomed-CT codes in the DISTEMIST corpus.
   One peculiarity of the provided annotations is the existence of nested disease mentions.
With this we refer to complex expressions like "loss of kidney graft from chronic nephropathy"
which is a disorder mention that contains another one, namely "chronic nephropathy". In the
DISTEMIST Corpus such entities appear as separate annotations and the total count of this
mentions is 411. During the pre-processing, we resolve nested disorder mentions in favour of
the longest one.
   The text length measures obtained by tokenizing each text with RoBERTa byte-level Byte-
Pair-Encoding tokenizer [11] showed that the longest text contained 1,486 tokens and, most
importantly, 248 texts from the training set exceeded the maximum length of input for the
RoBERTa model selected as core of our system, which is set to 512. This fact brought us to
perform sentence-level NER, thus text pre-processing consisted in splitting the texts in sentences
using the SentenceRecognizer from the SpaCy processing pipeline 3 . SpaCy´s SentenceRecognizer
relies on es_core_news_sm pre-trained language model 4 which was used to predict whether
each token of every text starts a sentence or not. Some basic statistics of the dataset are
summarized in Table 1.
   It is important to mention that we randomly splitted the training set to be able to perform
in-house validation during the process of system development. The resulting validation set
    3
        https://spacy.io/api/sentencerecognizer
    4
        https://spacy.io/models/es
contained 30% of training data.
   As for the test set, it consisted of 250 additional cases for both sub-tracks, while the predic-
tions were made on a concatenation of test and background sets with the total number 3,000
documents, which also we subjected to the same pre-processing as the training set.


3. System Description
In this section, we describe the systems developed for DISTEMIST-entities and DISTEMIST-
linking sub-tasks.

3.1. Sub-task 1
The DISTEMIST-entities subtrack required automatically finding disease mentions in clinical
cases. Taking into account the length of clinical texts in the dataset, as we anticipated in Section
2, we opted for a sentence-level NER approach based on fine-tuning of two pre-trained RoBERTa
language models [11].
   Our first system is based on a fine-tuned biomedical-clinical model5 , trained on a combination
of biomedical and clinical texts that, hypothetically, suits better for the proposed task, due to
the particular syntax and vocabulary that clinical texts present, comparing to medical scientific
literature.
   The second system developed for NER subtask leveraged medical domain-specific model 6
pre-trained on a the medical crawler collection [12] extended with data from other sources,
such as SciELO-Spain-Crawler [13] and other.
   The two models were fine-tuned for the token classification task by adding a linear classifier
layer preceded with a 0.1 dropout layer on top of the original architecture.

3.2. Sub-task 2
This task aims to assign each mention found in the DISTEMIST-entities track a code from the
list of relevant terms from Snomed-CT provided by the competition organizers [14]. To address
this, we suggest a three step approach. First, we calculate the embeddings for all the entity
spans detected in subtask 1 and for every term in the ontology with RoBERTa models that we
fine-tuned for the previous subtask. We achieve this by mean pooling the last hidden state of
the model’s output. Then, we link each entity to the closest term in the ontology by calculating
the cosine similarity between the resulting embeddings. The second step of our approach relies
on looking for perfect matches between the mentions found and the ontology terms. In this
phase, the system replaces the Snomed-CT codes assigned in the previous step if the mention’s
string is exactly the same as any ontology’s term. 14429 entities, out of which 2618 are unique,
have been found in this step. Lastly, we follow the same approach, but in this case we look for
direct matches in the training set provided by the organizers where 6246 additional entities
are found - 633 unique-. Therefore, exact string matching finds 20675 out of the 48699 entities
detected in the previous subtask. Figure 2 illustrates architecture of the proposed system.
   5
       https://huggingface.co/PlanTL-GOB-ES/roberta-base-biomedical-clinical-es
   6
       https://huggingface.co/PlanTL-GOB-ES/bsc-bio-es
Figure 2: Named entity normalization system architecture.


4. Experimental Setup
All the transformer models were fine-tuned on a single NVIDIA Ampere A100 GPU by making
use of the Hugging Face’s transformers Python library [15].
   In order to maximize the resulting performance of our systems we carried out a hyperpa-
rameter optimization powered by Optuna Framework [16]. The cited framework incorporates
efficient implementation of both searching and pruning strategies. During the optimization, Op-
tuna infers concurrence relations between the searched parameters to switch from independent
sampling to concurrence sampling after few trials. In addition, a pruning algorithm monitors
intermediate training results and terminates unpromising trials.
   The hyperparameter space for the optimization was defined as follows:

    • Learning rate: float value between 3𝑒 − 5 and 5𝑒 − 5
    • Number of training epochs: integer value between 3 and 10
    • Training batch size: 8, 16 and 32
    • Weight decay: float value between 1𝑒 − 12 and 1𝑒 − 1
    • AdamW optimizer epsilon: float value between 1𝑒 − 10 and 1𝑒 − 6
    • Warmup steps: integer value between 0 and 1000

   Finally, Table 2 summarizes hyperparameters selected for each model after optimization
trials.


5. Results
Metrics selected by DISTEMIST organization team to evaluate system performance on both
tracks are micro-average precision (MiP), micro-average recall (MiR) and micro-average F1-
score (MiF1) - those are very commonly used for text and token classification tasks. Table 3
                                    RoBERTa biomedical       RoBERTa clinical
                 Learning rate               4e-5                   5e-5
                Training epochs               10                     10
                   Batch size                  8                     16
                 Weight decay                1e-6                   3e-6
                AdamW epsilon               2.6e-9                  1e-8
                 Warmup steps                 73                    440

Table 2
Hyperparameters selected for each model.


                   Subtask                 System          MiP      MiR     MiF1
             DISTEMIST-entities     RoBERTa biomedical    0.7520   0.7259   0.7387
                                     RoBERTa clinical     0.7519   0.7221   0.7367
                                         MEAN             0.6502   0.6079   0.6221
                                           SD             0.1633   0.1475   0.1585
             DISTEMIST-linking      RoBERTa biomedical    0.4134   0.4069   0.4101
                                     RoBERTa clinical     0.4163   0.4081   0.4122
                                         MEAN             0.3965    0.335   0.3588
                                           SD             0.1381   0.1202    0.127

Table 3
Official results obtained by the SINAI team in DISTEMIST-entities and DISTEMIST-linking subtasks
along with the mean (MEAN) and standard deviation (SD) of all participants’ submissions.


summarises the results obtained by the SINAI team during the official evaluation carried out by
the organizers.
   The evaluation demonstrated that the systems pairs presented on both sub-tracks achieve
very similar results despite the fact of being based on two different pre-trained models. Using
the biomedical model on EHRs can be considered as cross-domain experiment and the fact that
our biomedical system exhibits encouraging results (0.7387 MiF1) on the NER task highlights
the existence of domain transfer potential between biomedical and clinical fields. The clinical
model also performed well on the first sub-track scoring 0.7367 MiF1 on the test set.
   Regarding the results obtained in the second subtask, our best system achieved a MiP of
0.4163, a MiR of 0.4081, and a MiF1 of 0.4122, all of them being higher than the average scores
of all the participants. It is important to note that these results are highly dependent on the
ones scored in the NER subtrack since the entities used to assign the normalized codes are the
ones detected in that task.

5.1. Error analysis
With the objective of forming an opinion about pockets of low performance of our NER system,
we conducted an error analysis based on model’s performance on custom validation set that
consisted in a random 30% split DISTEMIST Corpus, as stated in Section 2. The most frequently
observed error type is related to nested entities. The system occasionally either detects a
                                            Entity span                              Detected
                                     insuficiencia renal aguda
                                                                                        ✓
                                      eng.: acute renal failure
               insuficiencia renal aguda secundaria a administración de aciclovir
                                                                                         ✗
                  eng.: acute renal failure secondary to acyclovir administration

Table 4
Example 1 of incorrect labelling of a nested entity


                                            Entity span                              Detected
                epilepsia rolándica izquierda secundaria a cisticercosis sistémica
                                                                                        ✗
                  eng.: left rolandic epilepsy secondary to systemic cysticercosis
                                       cisticercosis sistémica
                                                                                        ✓
                                     eng.: systemic cysticercosis

Table 5
Example 2 of incorrect labelling of a nested entity


complex mention and, consequently is not able to recognize one that forms part of it, as shown
on Table 4, or detects only simple mention without returning the nested one, as illustrates
Table 5.


6. Conclusions and future work
In this paper, we present systems developed by the SINAI team for DISTEMIST track at CLEF
2022 BioASQ workshop. This challenge consisted of two sub-tracks: one focused on detection
of disease mention in EHRs and the other targeted mapping this mention to codes from the
Snomed-CT ontology.
   In order to address these two tasks our team followed an approach that takes as its basis fine-
tuning of two transformer models pre-trained on biomedical and biomedical-clinical corpora.
For the DISTEMIST-entities sub-track we fine-tune both models to perform sentence-level
NER with a prior hyperparameter optimization step. For the DISTEMIST-linking sub-track we
applied several techniques to find the closest term in the Snomed-CT ontology in order to assign
a code to each entity.
   The resulting performance of our NER systems revealed the remarkable cross-domain po-
tential that the selected transformer-based model pre-trained on biomedical corpora has when
fine-tuned on clinical texts. Our best performing NER system was also made publicly available
on HuggingFace Hub 7 . As for the entity linking, calculating embedding distances provided
encouraging results for entities that did not appear neither in the ontology nor in the training
dataset.
   For future work, we plan to address nested entities issue by testing novel approaches such as

    7
        https://huggingface.co/chizhikchi/Spanish_disease_finder
Parallel Instance Query Networks (PIQN) [17]. Furthermore, a more in-depth error analysis
needs to be carried out in order to infer the reasons of false positives and false negatives in
the test test predictions and be able to suggest solutions for these problems. With the object
of improving entity linking system performance, we plan on improving both matching and
semantic similarity calculation. Having in mind that abbreviations and acronyms are commonly
used in medical texts [4] we contemplate including disambiguation of abbreviated terms as
a step prior to matching in our processing pipeline. Furthermore, we aim to execute some
experiments using Levenshtein distance as the indicator of semantic similarity between text
sequences.


7. Acknowledgements
This work has been partially supported by Big Hug project (P20_00956, PAIDI 2020) and WeLee
project (1380939, FEDER Andalucía 2014-2020) funded by the Andalusian Regional Government,
LIVING-LANG project (RTI2018-094653-B-C21) funded by MCIN/AEI/10.13039/501100011033


References
 [1] A. Miranda-Escalada, A. Gonzalez-Agirre, J. Armengol-Estapé, M. Krallinger, Overview of
     automatic clinical coding: Annotations, guidelines, and solutions for non-english clinical
     cases at codiesp track of clef ehealth 2020., in: CLEF (Working Notes), 2020.
 [2] F. Catling, G. P. Spithourakis, S. Riedel, Towards automated clinical coding, International
     Journal of Medical Informatics 120 (2018) 50–61. URL: https://www.sciencedirect.com/
     science/article/pii/S1386505618304039. doi:https://doi.org/10.1016/j.ijmedinf.
     2018.09.021.
 [3] B. Al-Hablani, The use of automated snomed ct clinical coding in clinical decision support
     systems for preventive care, Perspectives in health information management 14 (2017).
 [4] R. Islamaj Dogan, G. C. Murray, A. Névéol, Z. Lu, Understanding PubMed® user search
     behavior through log analysis, Database 2009 (2009). URL: https://doi.org/10.1093/database/
     bap018. doi:10.1093/database/bap018.
 [5] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
 [6] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained
     biomedical language representation model for biomedical text mining, Bioinfor-
     matics (2019). URL: https://doi.org/10.1093%2Fbioinformatics%2Fbtz682. doi:10.1093/
     bioinformatics/btz682.
 [7] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon,
     Domain-specific language model pretraining for biomedical natural language processing,
     ACM Transactions on Computing for Healthcare 3 (2022) 1–23. URL: https://doi.org/10.
     1145%2F3458754. doi:10.1145/3458754.
 [8] C. P. Carrino, J. Armengol-Estapé, A. Gutiérrez-Fandiño, J. Llop-Palao, M. Pàmies,
     A. Gonzalez-Agirre, M. Villegas, Biomedical and clinical language models for span-
     ish: On the benefits of domain-specific pretraining in a mid-resource scenario, 2021.
     arXiv:2109.03570.
 [9] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farr’e-Maduell, D. Estrada, A. Nentidis,
     A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of distemist at bioasq:
     Automatic detection and normalization of diseases from clinical texts: results, methods,
     evaluation and multilingual resources, in: Working Notes of Conference and Labs of the
     Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2022.
[10] A. Miranda-Escalada, E. Farré, L. Gasco, S. Lima, M. Krallinger, DisTEMIST corpus:
     detection and normalization of disease mentions in spanish clinical cases, 2022. URL:
     https://doi.org/10.5281/zenodo.6532684. doi:10.5281/zenodo.6532684, Funded by the
     Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).
[11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv:1907.11692
     [cs] (2019). URL: http://arxiv.org/abs/1907.11692, arXiv: 1907.11692.
[12] C. P. Carrino, J. Armengol-Estapé, O. de Gibert Bonet, A. Gutiérrez-Fandiño, A. Gonzalez-
     Agirre, M. Krallinger, M. Villegas, Spanish biomedical crawled corpus: A large, diverse
     dataset for spanish biomedical language models, 2021. arXiv:2109.07765.
[13] A. Intxaurrondo, Scielo-spain-crawler, 2019. URL: https://doi.org/10.5281/zenodo.2541681.
     doi:10.5281/zenodo.2541681, Funded by the Plan de Impulso de las Tecnologías del
     Lenguaje (Plan TL), the ICTUSnet INETRRREG SUdooe project and "La MARATO de TV3".
[14] L. Gascó, M. Krallinger, Distemist gazetteer, 2022. URL: https://doi.org/10.5281/zenodo.
     6505583. doi:10.5281/zenodo.6505583, Funded by the Plan de Impulso de las Tec-
     nologías del Lenguaje (Plan TL).
[15] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Huggingface’s transformers: State-of-the-art
     natural language processing, 2019. URL: https://arxiv.org/abs/1910.03771. doi:10.48550/
     ARXIV.1910.03771.
[16] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperpa-
     rameter optimization framework, in: Proceedings of the 25rd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, 2019.
[17] Y. Shen, X. Wang, Z. Tan, G. Xu, P. Xie, F. Huang, W. Lu, Y. Zhuang, Parallel instance
     query network for named entity recognition, arXiv preprint arXiv:2203.10545 (2022).