=Paper=
{{Paper
|id=Vol-3180/paper-15
|storemode=property
|title=HPI-DHC @ BioASQ DisTEMIST: Spanish Biomedical Entity Linking with Pre-trained Transformers
and Cross-lingual Candidate Retrieval
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-15.pdf
|volume=Vol-3180
|authors=Florian Borchert,Matthieu-P. Schapranow
|dblpUrl=https://dblp.org/rec/conf/clef/BorchertS22
}}
==HPI-DHC @ BioASQ DisTEMIST: Spanish Biomedical Entity Linking with Pre-trained Transformers
and Cross-lingual Candidate Retrieval==
HPI-DHC @ BioASQ DisTEMIST: Spanish Biomedical Entity Linking with Pre-trained Transformers and Cross-lingual Candidate Retrieval Florian Borchert, Matthieu-P. Schapranow Digital Health Center, Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany Abstract Biomedical named entity recognition and entity linking are important building blocks for various clinical applications and downstream NLP tasks. In the clinical domain, language resources for developing entity linking solutions are scarce: only a few datasets have been annotated on the level of concepts and the ma- jority of concept aliases in target ontologies are only available in English. In such a resource-constrained setting, pre-training and cross-lingual transfer are promising approaches to improve performance of entity linking systems. In this paper, we describe our contribution to the BioASQ DisTEMIST shared task. The goal of the task is to extract disease mentions from Spanish clinical case reports and map them to concepts in SNOMED CT. Our system comprises a Transformer-based named entity recognition model, a hybrid candidate generation approach, and a rule-based reranking step. For candidate generation, we employ an ensemble of 1) a TF-IDF vectorizer based on character n-grams and 2) a cross-lingual SapBERT model. Our best run for the entity linking subtrack achieves a micro-averaged F1 score of 0.566, which is the best score across all submissions in this track. A detailed analysis of system performance highlights the importance of task-specific entity ranking and the benefits of cross-lingual candidate retrieval. Keywords Spanish, Case Reports, Biomedical, Cross-lingual, Entity Linking, Ensemble, SapBERT, SNOMED CT 1. Introduction Extraction of structured metadata from text documents through named entity recognition (NER) and entity linking (EL) are the basis for many downstream NLP components and applications, such as relationship extraction, semantic indexing, or information retrieval. Particularly rich ontologies such as SNOMED CT have been developed to model the clinical domain [1], where they are enabling semantic interoperability between software systems and support a variety of clinical applications [2, 3]. While the richness of clinical ontologies opens up the potential for fine-grained semantic annotation of free-text documents, it poses challenges for systems that aim to perform this annotation automatically. Choosing the correct mapping from textual mentions to one or more CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ florian.borchert@hpi.de (F. Borchert); matthieu.schapranow@hpi.de (Matthieu-P. Schapranow) https://hpi.de/digital-health-center/members/working-group-in-memory-computing-for-digital-health/ (Matthieu-P. Schapranow) 0000-0003-1079-6500 (F. Borchert); 0000-0001-6601-2942 (Matthieu-P. Schapranow) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) concepts in an ontology is highly context-specific, therefore often ambiguous, and can depend on the respective application domain in subtle ways. Moreover, EL is inherently a low-resourced task along multiple dimensions. Firstly, even the largest annotated corpora cover only a very small subset of the concepts in common biomedical terminologies, such as the Unified Medical Language System (UMLS) [4, 5]. Secondly, the vast majority of terms in these ontologies are only available in English, e.g., around 70% of terms in the Unified UMLS metathesaurus come from the English language, around 11% from Spanish, and less than 3% from each of the other included languages1 . For the SNOMED CT ontology, which is part of the UMLS, translation into different national languages is an ongoing effort that has to date not been completed for many countries like Germany. Corpora annotated on the level of entity mentions and their mapping to concepts for languages other than English are therefore scarce and immensely valuable to drive progress in the field of biomedical EL [6, 7, 8, 9, 10]. 1.1. Related Work Throughout this work, we follow the terminology and general architecture for EL systems proposed Sevgili et al. [11] and distinguish components for Mention Detection, Candidate Gener- ation, and Entity Ranking. In their work, the authors particularly review neural approaches for EL, which have recently received increased attention by the research community. Nevertheless, many tools used by practitioners are based on rule-based and non-neural statistical approaches, which still provide competitive baselines on benchmark datasets [12, 13, 14, 15, 16]. While neural systems with dense entity retrieval have been proposed [17, 18], other systems are hybrid in the sense that they include non-neural components for candidate generation, often based on TF-IDF scores (or variants thereof) calculated from surface forms of mentions and concept aliases [19, 20, 21, 22, 23]. Other neural components can improve existing non-neural EL systems, e.g., by filtering candidate sets based on semantic type prediction [24]. 1.2. Contribution and Outline In this work, we describe our contribution to the DisTEMIST shared task. The goal of the task is to extract disease mentions (subtrack 1) from Spanish-language clinical case reports and link them to SNOMED CT codes (subtrack 2). We propose a hybrid EL system, outlined in Figure 1, which makes use of: • a standard Transformer-based NER pipeline for mention detection • an ensemble of two complementary candidate generation approaches • a rule-based reranker that is specifically adapted to the DisTEMIST datasets The remainder of this work is structured as follows: in section 2, we provide an overview of the used datasets and generated dictionaries. In section 3, we share a detailed description of the components in our system and the methods used to adapt its parameters to annotated datasets. In section 4, we describe the results of our approach in the context of the DisTEMIST shared task. Our findings, limitations, and potential improvements are discussed in section 5, followed by a conclusion and outlook in section 6. 1 https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/statistics.html Input Subtrack 1 Predictions Subtrack 2 Predictions "Antecedente de intervención de Mention Code Score Linker Scopinaro un año atrás, - anemia 271737000 1.00 n-gram cursando un postoperatorio con - varices en extremidades inferiores 72866009 0.89 SapBERT anemia, varices en extremidades - hipocalcemia 128136007 0.99 SapBERT inferiores, hipocalcemia y - gonartritis seronegativa 95679004 1.00 training set gonartritis seronegativa..." ... Mention Detection Candidate Generation Entity Ranking Sentence Token- NER Hyperparameters Splitting ization Factors: k (number of Weights Semantic types candidates) Thresholds Ensemble agreement Dictionary Creation Suppression status Preferred term status DisTEMIST Gazetteer n-gram TF-IDF k-NN 111K concepts / Vectorizer 147K aliases Rule-based Multilingual Ensemble Reranker UMLS Dictionary SapBERT- k-NN UMLS XLMR 2021AB Training Set Codes 4.5M concepts 111K concepts 16.5M aliases 2.4M aliases Figure 1: Overview of our entity linking system. As a first step, disease mentions are extracted by a Tranformer-based NER pipeline, the output of which are the predictions for DisTEMIST subtrack 1. Two candidate generators are used to retrieve candidate concepts for these mentions from two different dictionaries. The candidate lists are combined using individual weights for their predictions, filtered by a confidence threshold, and reranked to produce the final output for subtrack 2. 2. Materials In this section, we describe the datasets provided in the DisTEMIST shared task and the dictionaries we use for retrieving concepts in our EL system. 2.1. DisTEMIST Datasets The dataset available for training consists of 750 Spanish-language clinical case reports from a variety of medical specialties. All of these cases have been manually annotated by medical experts with mentions of diseases. For 584 documents, these mentions have also been annotated with concept IDs from a subset of SNOMED CT. 250 additional documents were annotated in the same manner and served as a held-out test set for the evaluation of participating systems. A detailed overview of the corpus is given by Miranda-Escalada et al. [25]. Furthermore, translations of the data and annotations into 6 different languages have been provided, although we do have not made use of these multi-lingual resources in the development of our system. 2.2. Dictionaries For the DisTEMIST shared task, a dictionary (gazetteer) of relevant concepts and Spanish terms is provided, containing 111,179 distinct concepts with 147,280 aliases. As the number of aliases is relatively low compared to the number of concepts (most concepts only have a single alias), we assume that there are many surface forms of concepts not covered by the gazetteer. Therefore, we extend the set of available synonyms by means of the UMLS metathesaurus (release 2021AB), mapping SNOMED CT concepts to other terminologies, including the US version of SNOMED CT. We retain only such concepts in the UMLS metathesaurus, which are present in the DisTEMIST gazetteer, as only these are considered for evaluation in the shared task. Thus, we obtain a multilingual dictionary and extend the set of available synonyms more than 16-fold to 2,416,514 in a variety of languages beyond Spanish (the vast majority is English). 3. Methods In this section, we describe our EL system and procedures for choosing its (hyper-)parameters. 3.1. Mention Detection For recognizing mentions of diseases, we employ an NER approach implemented with Hugging Face Transformers, consisting of a BERT-based encoder followed by a to- ken classification head [26]. The encoder has been initialized with weights from the PlanTL-GOB-ES/roberta-base-biomedical-clinical-es checkpoint, which was ob- tained by pre-training a RoBERTa model on a large unlabeled corpus of Spanish biomedical- clinical documents [27, 28]. The overall experimental setup is based on the Hydra framework and has been adapted from our earlier work on German-language clinical NER in the context of the GGPOnc project [29, 30]. 3.1.1. Pre- and Post-Processing To use a token-based NER implementation with the DisTEMIST datasets, we split the documents into sentences and tokens using the general-domain spaCy model es_core_news_md [31]. Sentence splitting is necessary, as the NER model’s input layer size is fixed to a constant number of subword tokens. We align the entity offsets given in the training dataset to individual tokens and convert them into IOB-encoded class labels. This procedure is performed in reverse for producing the offset-based submission format of the DisTEMIST shared task. As the NER pipeline systematically produces artifacts concerning entity boundaries on the test set, we apply some simple cleanup steps. A substantial number of detected entities contain line breaks (mostly at the end), which does not occur in the training set and hurts entity linking performance. We therefore adjust the boundaries of such entities, retaining all characters up until the first line break. Moreover, when entities end with one or more non-word character (regular expression \W), these characters are also cropped. 3.1.2. Training and Model Selection For model selection and estimation of generalization performance, we split the dataset into a training and validation set on the level of documents. For the validation set, we sample 117 documents (20%) from the 584 documents available in DisTEMIST subtrack 2. This validation set is used for evaluating our models for both subtracks and not used during training or model selection. The remaining 633 documents available for subtrack 1 are first split into sentences, which are then randomly assigned to the training (10,069 sentences) and development set (1,727 sentences). The NER model is trained for 100 epochs on the training set with a single Nvidia A40 GPU (48 GB RAM). We keep the checkpoint that achieves the highest F1 -score on the development set. Using Hydra, we perform a grid search over the following hyperparameters: learning rate, learning rate schedule, warmup ratio, label smoothing factor, and weight decay. The hyperparameter search was carried out on a machine with six A40 GPUs, 128 AMD EPYC 7543 CPU cores, and 2 TB of main memory. The optimal hyperparameters are available as Hydra configurations together with the project’s source code [32] 3.2. Entity Linking The mentions detected as described in the previous section are independently linked to potential SNOMED CT concepts. To this end, we employ two different candidate generation approaches and combine their results in an ensemble. Following, the scores in the ranked list of candidate concepts are adjusted based on a number of rules, resulting in the final reordered candidate list. From this list, only the top result is considered for submission in the DisTEMIST shared task. 3.2.1. Candidate Generation TF-IDF with character n-grams Our first candidate generation approach is based on the im- plementation from scispaCy [16]. We have converted the (mono-lingual) DisTEMIST gazetteer to the required format to rebuild the indices used by the candidate generator. Concepts and aliases are encoded as TF-IDF vectors calculated over character 3-grams. At prediction time, the same encoding is applied to mentions and an approximate nearest neighbor search over concepts is applied to generate a ranked candidate list. We refer the reader to Neumann et al. [16] for further details. Note that some improvements implemented in scispaCy, such as abbreviation expansion or filtering based on available definitions in the UMLS, were not applicable here. Cross-lingual SapBERT To leverage the large set of multilingual concept aliases available through the UMLS, we use the cross-lingual version of SapBERT to obtain representations of mentions and candidate concepts from the multilingual UMLS-based dictionary in the same embedding space [33]. SapBERT weights are obtained by a technique called self-alignment pre-training (SAP), which allows to fine-tune BERT on synonyms from the UMLS. We apply a simple nearest neighbor search over these embeddings for candidate generation, using the normalized dot product as a distance metric. For our experiments, we initialize the encoder from the checkpoint cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR, available through the Hugging Face Hub. Again, we refer the reader to Liu et al. [33] for details on the pre-training method. Ensemble To leverage the individual strengths of each candidate generator, we combine their predictions using the following approach: for each candidate generator, a threshold in the interval [0.0, 0.1] is used to filter concepts with scores below this threshold. Moreover, each candidate generator is assigned a weight in the range [0.0, 0.1], which is multiplied with each score. The resulting candidate lists are merged, sorted by the weighted score. Thresholds and weights are hyperparameters that need to be chosen by the user or can be derived automatically as described in subsubsection 3.2.3. Note that the number of generators is not fixed to two, and the same approach is applicable in the presence of more candidate generators. 3.2.2. Entity Ranking The candidate rankings resulting from the aforementioned steps are based on generic approaches with limited possibilities to adapt the ranking with respect to human-labeled data and their specific annotation policies. We therefore implement a set of rules to reorder the candidate lists: • Semantic types We define weights 𝑤𝑑𝑖𝑠𝑜𝑟𝑑𝑒𝑟 , 𝑤𝑓 𝑖𝑛𝑑𝑖𝑛𝑔 and 𝑤𝑚𝑜𝑟𝑝ℎ𝑜𝑙𝑜𝑔𝑖𝑐_𝑎𝑏𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑡𝑦 in the range [0.0, 1.5] for the semantic types (according to the DisTEMIST gazetteer) covering the vast majority of concepts in the DisTEMIST training data. The score for each concept belonging to these semantic types is multiplied by the respective weight. • Agreement between candidate generators If the same concept occurs 𝑧 times in the combined candidate list resulting from the ensemble, each score is multiplied by a factor of 1 + 𝛽 · 𝑧, with 𝛽 taking values in the range [0.0, 1.0]. • Suppression status For each concept, we subtract the fraction of suppressed terms for this concept according to the UMLS, multiplied by a factor 𝑤𝑠𝑢𝑝𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 in the range [−1.5, 1.5]. • Preferred term status For each mention that matches the canonical name or preferred term in the UMLS or DisTEMIST gazetteer, we apply a factor 𝑤𝑝𝑟𝑒𝑓 𝑒𝑟𝑟𝑒𝑑 in the range [0.0, 1.5]. These rules are tailored for the DisTEMIST challenge, but can be easily extended or adapted for other datasets. All rules make use of hyperparameters (𝑤, 𝛽), which we choose as described in subsubsection 3.2.3. Training Set Lookup As a final post-processing step, we determine whether a mention is exactly identical to one of the mentions in the DisTEMIST training data. If this is the case, the concept annotated in the training set is set at the first position in the candidate list. Conceptually, we thereby introduce another (exact) dictionary lookup with precedence above all other candidate generators. 3.2.3. Model Selection Our unsupervised candidate generation and rule-based reranking approaches do not require training, but have a number of hyperparameters that can be tuned using given gold-standard concept annotations. To this end, we use a Bayesian hyperparameter sweep provided through the Weights & Biases platform, based on a Gaussian Process model, and optimize the parameters to maximize F1 score on the training set [34]. The optimal hyperparameters found in this manner are available together with the project’s source code [32]. Table 1 Results for subtrack 1 (entities). We report precision (P), recall (R), and F1 score for all four submitted runs on the validation set (partial and strict evaluation) and final performance on the test data. We submitted runs with two different hyperparameter settings, with and without post-processing. Validation Set Test Set Partial Match Strict Match (Submission) P R F1 P R F1 P R F1 Hyperparameters 1 .862 .878 .870 .744 .758 .751 .730 .736 .733 Hyperparameters 2 .867 .865 .866 .746 .745 .745 .730 .726 .728 Hyperparameters 1 + Post-Processing .864 .878 .871 .756 .770 .763 .743 .748 .746 Hyperparameters 2 + Post-Processing .870 .865 .867 .758 .756 .757 .742 .737 .739 4. Results In this section, we share the results of our system in the context of the DisTEMIST shared task. 4.1. Subtrack 1: Entities The best run of our NER system achieved a micro-averaged F1 score of .7458 on the hold-out test dataset, scoring overall second in this subtrack. Results for all NER runs are given in Table 1. We submitted the results for two different hyperparameter settings that achieved the highest scores on the development set. These settings mainly differ by optimization settings, such as the learning rate schedule. Following prior work on biomedical entity linking [35, 24], we internally use the neleval tool [36] to compute metrics in a strict and loose evaluation setting, where the loose setting allows for partial matches weighted by the amount of overlap with the ground truth. The metrics computed in the strict setting are identical to the micro-averaged scores used for evaluation in the DisTEMIST shared task. We note that the impact of post-processing is very small when allowing for partial matches, as its role is mainly a correction of entity boundaries. With strict evaluation, post-processing improves F1 scores by 1.1-1.3 pp. Furthermore, we observe that our performance estimate on the validation set is slightly too optimistic when compared to the test set results. Although we did not test statistical significance of the performance differences, our estimate of the overall ranking of the four submitted runs is consistent with test set performance. Table 2 Isolated entity linking results with given gold-standard entity mentions. We report precision (P), recall (R), and F1 score for all 5 submitted runs on the training and validations set. The training set has not been used to train the entity linking models, but was used for model selection only. Results for the complete system including lookup of codes in the training data are not reported for the training set, as the performance would be (trivially) perfect Training Set Validation Set (Gold Entities) (Gold Entities) P R F1 P R F1 n-gram TF-IDF (DisTEMIST gazetteer) .490 .463 .476 .428 .397 .412 SapBERT (DisTEMIST gazetteer + UMLS) .474 .454 .464 .457 .434 .445 Ensemble .590 .493 .537 .563 .457 .504 Ensemble + Reranking .667 .558 .608 .659 .534 .590 Ensemble + Reranking + Training Lookup - - - .766 .625 .688 4.2. Subtrack 2: Linking The best run of the complete system for entity linking achieved a micro-averaged F1 score of .566, which is the best performance across all submissions for DisTEMIST subtrack 2. 4.2.1. Isolated Entity Linking Performance To understand the impact of our system’s components, we report the isolated entity linking performance, i.e., with given ground truth entity mentions from the DisTEMIST training data, separately in Table 2. Both candidate generation approaches achieve similar performance, which is notably improved by merging their predictions in an ensemble. Consequent reranking and in particular the lookup in the training set improve performance on the validation set by more than 18pp. in total over the plain ensemble, highlighting the importance of task-specific reranking. Although the hyperparameters used for the ensemble and candidate reranking have been determined by optimization on the training set alone, the performance decrease on the validation set is small, indicating good generalizability of our approach to unseen data. 4.2.2. Overall System Performance For the DisTEMIST shared task, both mention detection and entity linking had to be addressed, meaning that errors during mention detection would also impact entity linking performance. The results for the combined system are shown in Table 3. The large drops in F1 score compared to the results from Table 2 (up to 12.5 pp. in the strict evaluation setting for the best-performing run) are expected due to imperfect mention detection. Throughout our participation in the shared task, we used the strict evaluation metrics on the validation set (Table 3, column 6) as a proxy for performance on unseen data. Indeed, these values are very close to the final performance on the test set, with differences of < 0.4 pp. in precision, recall, and F1 score for the best-performing run. Table 3 Combined entity linking results with entity mentions predicted by the NER pipeline. We report precision (P), recall (R), and F1 score for all 5 submitted runs on the validations set (partial and strict evaluation) as well as final test set performance Validation Set Test Set (Predicted Entities) (Submission) Partial Match Strict Match P R F1 P R F1 P R F1 n-gram TF-IDF (DisTEMIST gazetteer) .369 .350 .359 .339 .322 .330 .358 .365 .361 SapBERT (DisTEMIST gazetteer + UMLS) .393 .379 .386 .365 .353 .359 .364 .374 .369 Ensemble .481 .398 .435 .447 .374 .408 .468 .389 .425 Ensemble + Reranking .566 .470 .513 .529 .443 .482 .543 .451 .493 Ensemble + Reranking + Training Lookup .653 .544 .593 .617 .517 .563 .621 .520 .566 5. Evaluation and Discussion In this section, we discuss of findings and point to potential improvements of the system. 5.1. Candidate Generation Performance As the evaluation metrics for the DisTEMIST shared task consider only a single concept, the system is optimized to achieve a high F1 score for the first candidate, thus favoring a rather aggressive suppression of candidates with lower scores. Indeed, the combination of ensembling with reranking drastically improves precision, as shown in Table 2 and Table 3. To understand the performance of our system, it is insightful to consider metrics for different values of 𝑘, as shown in Figure 2. Ensembling, reranking, and training set lookup improve precision for all values of 𝑘. Surprisingly, precision of SapBERT-based candidate generation is always higher than the TF-IDF-based approach, although the latter focuses on a presumably more relevant subset of terms given by the DisTEMIST gazetteer. In terms of recall, the ensemble only slightly outperforms the individual linkers at 𝑘 = 1, with only small gains in recall for 𝑘 > 1. In contrast, recall of the SapBERT model improves steadily with increasing 𝑘. However, our candidate generators fail to retrieve all relevant concepts even for large values of 𝑘, with recall for 𝑘 = 100 only reaching .780 for SapBERT and even lower values after the ensembling step due to the application of thresholds. Therefore, we consider increasing the recall during candidate generation an important direction for improving overall system performance. Prior work has proposed strategies to achieve such improvements [37], recover from poor candidate generation [23], or skip candidate generation altogether [35]. 5.2. Choice of Candidate Generators and Dictionaries The choice of candidate generators and particularly the underlying dictionaries are likely to have a large impact on system performance. We did not explore this dimension in much detail, and opted for only two candidate generators: one that is based on a very focused dictionary and matching based on purely morphological features, as well as a cross-lingual approach with Figure 2: Precision, recall and F1 scores measured on the validation set for different values of 𝑘 (number of candidates) a very large dictionary based on (learned) distributional semantics. It will be worthwhile to explore other subsets of dictionaries and different approaches to candidate generation, e.g., BM25, which has been used in previous hybrid entity linking systems [38, 19]. In turn, these might alleviate the rather poor candidate generation performance we described in the previous section. The adaptability of our system to other languages partially depends on the availability of high quality dictionaries in these languages. While the cross-lingual linker based on SapBERT should yield meaningful results for many languages with few or no synonyms in the UMLS, we hypothesize that it still benefits from the relatively large number of Spanish terms in the UMLS. 5.3. Impact of Reranking To gain more insights into the effects of our reranking approach, we show the proportion of semantic types for the top predictions in Figure 3. Without reranking, the distribution of semantic types of the ensemble differs considerably from the ground truth: only 75.0% of the predicted codes have semantic type disorder vs. 85.6% in the gold standard, while 14.8% of predicted concepts are of type morphologic abnormality vs. 3.7% in the gold standard. After reranking, the distribution of codes predicted by the system becomes much closer to the true distribution. When checking the system output manually, we regularly noticed ambiguity between concepts of types disorders and morphologic abnormalities in SNOMED CT and assume that annotators had a preference for concepts with type disorder. We therefore consider the reranking of such ambiguous results to match the specifics of the annotation procedure as an essential feature for good system performance. Prior work has demonstrated the positive impact of semantic type prediction on entity linking performance, a component that could also further improve the performance of our system [24]. Figure 3: Distribution of semantic types before and after reranking in comparison with the ground truth (determined on the validation set) 5.4. Effective Use of Training Data While our approach to entity linking is unsupervised in principle, we make use of the training data for hyperparameter selection and as an additional lookup step at the end of the pipeline. Although the latter results in a substantial performance increase for the DisTEMIST shared task, it is arguably the least generalizable part of the pipeline. Recently, zero-shot and clustering- based approaches for EL have gained popularity, which could potentially make better use of ground truth annotations than we did [19, 17, 23]. However, these approaches usually assume additional information about entities, such as descriptions, which are not provided as part of the shared task, but can be gathered from other resources. In addition, the SapBERT model used in our system can also be fine-tuned on task-specific labeled data, which has been shown to improve performance on some benchmark datasets [18]. 6. Conclusion and Outlook In this work, we gave an overview of our EL system in the context of the DisTEMIST shared task and analyzed how individual components contribute to its performance. While an ensemble of general unsupervised candidate generators configured with task-specific dictionaries provides a solid baseline, adaptations trough entity reranking and post-processing are crucial for improving system performance. For future work, we would like to investigate candidate generation approaches that yield a better recall and reranking algorithms whose parameters can be learned from annotated data. In addition, encoding of more contextual information to help disambiguate mentions will be a natural extension that has been employed in previous work and is straightforward to implement with modern, Transformer-based EL architectures. Although we have treated the problems of NER and EL separately, there is an obvious interaction between these tasks, which could benefit from modelling them jointly [11]. We believe that the findings from the shared task will be of great interest for other language communities with scarce language resources in the clinical domain. To enable reproducibility of our experimental results and future adaptations of our system, we make its source code and configuration available on GitHub [32]. Acknowledgments This work was partially supported by a grant of the German Federal Ministry of Research and Education (01ZZ1802H). References [1] K. Donnelly, Snomed-CT: the advanced terminology and coding system for eHealth, in: Medical and Care Compunetics 3, number 121 in Studies in Health Technology and Informatics, IOS Press, Amsterdam etc., 2006, pp. 279–290. [2] M. Lehne, J. Sass, A. Essenwanger, J. Schepers, S. Thun, Why digital medicine depends on interoperability, NPJ digital medicine 2 (2019) 1–5. [3] D. Demner-Fushman, N. Elhadad, C. Friedman, Natural language processing for health- related texts, in: Biomedical Informatics, Springer, 2021, pp. 241–272. [4] O. Bodenreider, The Unified Medical Language System (UMLS): Integrating biomedical terminology, Nucleic Acids Research 32 (2004) D267–D270. [5] S. Mohan, D. Li, Medmentions: A large biomedical corpus annotated with UMLS concepts, in: Proceedings of the 2019 Conference on Automated Knowledge Base Construction, Amherst, Massachusetts, USA), 2019. [6] J. A. Kors, S. Clematide, S. A. Akhondi, E. M. Van Mulligen, D. Rebholz-Schuhmann, A multilingual gold-standard corpus for biomedical concept recognition: The Mantra GSC, Journal of the American Medical Informatics Association 22 (2015) 948–956. [7] A. Névéol, K. B. Cohen, C. Grouin, T. Hamon, T. Lavergne, L. Kelly, L. Goeuriot, G. Rey, A. Robert, X. Tannier, et al., Clinical information extraction at the CLEF eHealth evaluation lab 2016, in: CEUR workshop proceedings, volume 1609, NIH Public Access, 2016, p. 28. [8] M. Kittner, M. Lamping, D. T. Rieke, J. Götze, B. Bajwa, I. Jelas, G. Rüter, H. Hautow, M. Sänger, M. Habibi, et al., Annotation and initial evaluation of a large annotated German oncological corpus, JAMIA Open 4 (2021) ooab025. [9] L. Campillos-Llanos, A. Valverde-Mateos, A. Capllonch-Carrión, A. Moreno-Sandoval, A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine, BMC medical informatics and decision making 21 (2021) 1–19. [10] A. Névéol, H. Dalianis, S. Velupillai, G. Savova, P. Zweigenbaum, Clinical natural language processing in languages other than English: Opportunities and challenges, Journal of biomedical semantics 9 (2018) 1–13. [11] Ö. Sevgili, A. Shelmanov, M. Arkhipov, A. Panchenko, C. Biemann, Neural entity linking: A survey of models based on deep learning, Semantic Web Preprint (2022) 1–44. Publisher: IOS Press. [12] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, C. G. Chute, Mayo clinical text analysis and knowledge extraction system (cTAKES): Architecture, component evaluation and applications, Journal of the American Medical Informatics Association 17 (2010) 507–513. [13] L. Soldaini, N. Goharian, QuickUMLS: A fast, unsupervised approach for medical concept extraction, in: MedIR workshop, SIGIR, 2016, pp. 1–4. [14] R. Leaman, Z. Lu, TaggerOne: Joint named entity recognition and normalization with semi-markov models, Bioinformatics 32 (2016) 2839–2846. [15] D. Demner-Fushman, W. J. Rogers, A. R. Aronson, MetaMap Lite: an evaluation of a new Java implementation of MetaMap, Journal of the American Medical Informatics Association 24 (2017) 841–844. [16] M. Neumann, D. King, I. Beltagy, W. Ammar, scispaCy: Fast and robust models for biomedical natural language processing, in: Proceedings of the 18th BioNLP Workshop and Shared Task, Association for Computational Linguistics, Florence, Italy, 2019, pp. 319–327. [17] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettlemoyer, Scalable zero-shot entity linking with dense entity retrieval, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6397–6407. [18] F. Liu, E. Shareghi, Z. Meng, M. Basaldella, N. Collier, Self-alignment pretraining for biomedical entity representations, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 4228–4238. [19] L. Logeswaran, M.-W. Chang, K. Lee, K. Toutanova, J. Devlin, H. Lee, Zero-shot entity linking by reading entity descriptions, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 3449–3460. [20] M. Sung, H. Jeon, J. Lee, J. Kang, Biomedical entity representations with synonym marginal- ization, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 3641–3650. [21] D. Xu, Z. Zhang, S. Bethard, A generate-and-rank framework with semantic type reg- ularization for biomedical concept normalization, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8452–8464. [22] S. Mohan, R. Angell, N. Monath, A. McCallum, Low resource recognition and linking of biomedical concepts from a large ontology, in: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2021, pp. 1–10. [23] R. Angell, N. Monath, S. Mohan, N. Yadav, A. McCallum, Clustering-based inference for biomedical entity linking, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 2598–2608. [24] S. Vashishth, D. Newman-Griffis, R. Joshi, R. Dutt, C. P. Rosé, Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets, Journal of Biomedical Informatics 121 (2021) 103880. [25] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from clinical texts: Results, methods, evaluation and multilingual resources, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings, 2022. [26] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: EMNLP 2020 — Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Systems Demonstrations, Association for Com- putational Linguistics (ACL), 2020, pp. 38–45. [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: a robustly optimized BERT pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [28] C. P. Carrino, J. Armengol-Estapé, A. Gutiérrez-Fandiño, J. Llop-Palao, M. Pàmies, A. Gonzalez-Agirre, M. Villegas, Biomedical and clinical language models for Spanish: On the benefits of domain-specific pretraining in a mid-resource scenario, arXiv preprint arXiv:2109.03570 (2021). [29] O. Yadan, Hydra - a framework for elegantly configuring complex applications, GitHub, 2019. https://github.com/facebookresearch/hydra (Last accessed: June 30th, 2022). [30] F. Borchert, C. Lohr, L. Modersohn, J. Witt, T. Langer, M. Follmann, M. Gietzelt, B. Arnrich, U. Hahn, M.-P. Schapranow, GGPONC 2.0 - the German clinical guideline corpus for oncology: Curation workflow, annotation policy, baseline NER taggers, in: Proceedings of the Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 3650–3660. [31] I. Montani, M. Honnibal, S. V. Landeghem, A. Boyd, H. Peters, M. Samsonov, J. Geovedi, P. O. McCann, J. Regan, G. Orosz, D. Altinok, S. L. Kristiansen, R. Roman, L. Fiedler, G. Howard, W. Phatthiyaphaibun, Y. Tamura, E. Bot, S. Bozek, M. Murat, M. Amery, B. Böing, P. K. Tippa, L. U. Vogelsang, R. Balakrishnan, V. Mazaev, G. Dubbin, J. Fukumaru, W. Henry, explosion/spaCy: v3.1.0: new pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more, 2021. https://doi.org/10.5281/zenodo.5079800 (Last accessed: June 30th, 2022). [32] HPI-DHC, DisTEMIST experiment repository, 2022. https://github.com/hpi-dhc/distemist_ bioasq_2022. (Last accessed: June 30th, 2022) DOI: 10.5281/zenodo.6783395. [33] F. Liu, I. Vulić, A. Korhonen, N. Collier, Learning domain-specialised representations for cross-lingual biomedical entity linking, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Online, 2021, pp. 565–574. [34] L. Biewald, Experiment tracking with Weights and Biases, 2020. https://www.wandb.com/ (Last accessed: June 30th, 2022). [35] R. Bhowmik, K. Stratos, G. de Melo, Fast and effective biomedical entity linking using a dual encoder, in: Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, Association for Computational Linguistics, Online, 2021, pp. 28–37. [36] J. Nothman, B. Hachey, W. Radford, neleval, 2018. https://github.com/wikilinks/neleval (Last accessed: June 30th, 2022). [37] S. Zhou, S. Rijhwani, J. Wieting, J. Carbonell, G. Neubig, Improving candidate genera- tion for low-resource cross-lingual entity linking, Transactions of the Association for Computational Linguistics 8 (2020) 109–124. [38] S. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Now Publishers Inc, 2009.