Overview of MultiCardioNER Task at BioASQ 2024 on Medical Specialty and Language Adaptation of Clinical NER Systems for Spanish, English and Italian Salvador Lima-López1,* , Eulàlia Farré-Maduell1 , Jan Rodríguez-Miret1 , Miguel Rodríguez-Ortega1 , Livia Lilli2,3 , Jacopo Lenkowicz2 , Giovanna Ceroni4,5 , Jonathan Kossoff5 , Anoop Shah4,5 , Anastasios Nentidis6,7 , Anastasia Krithara5 , Georgios Katsimpras5 , Georgios Paliouras5 and Martin Krallinger1 1 Barcelona Supercomputing Center, Plaça Eusebi Güell, 1-3, 08034 Barcelona, Spain 2 Real World Data Facility, Gemelli Generator, Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, 00168, Italy 3 Catholic University of the Sacred Heart, Rome, 00168, Italy 4 University College London, UK 5 University College London Hospitals NHS Foundation Trust, UK 6 National Center for Scientific Research “Demokritos”, Athens, Greece 7 Aristotle University of Thessaloniki, Thessaloniki, Greece Abstract Transformers and large language models (LLMs) are increasingly used for clinical data analysis, mostly in English, but also in many other languages used within medical care systems. To comply with clinical standards, it is critical to evaluate the generated results by means of benchmarking efforts based on high-quality manually annotated corpora. To foster the adaptation of general clinical natural language processing (NLP) components to the characteristics of medical specialties, as well as exploring cross-language adaptation techniques, we propose the MultiCardioNER task at BioASQ 2024. MultiCardioNER focuses on the adaptation of named entity recognition (NER) systems trained on multispecialty clinical case reports to cardiology, since cardiovascular diseases are the leading cause of death globally. The MultiCardioNER task covered two entity types (diseases and medications) in case reports written in three languages (Spanish, English and Italian). To generate a comparable Gold Standard clinical NER corpus, we used neural translation, annotation projection and manual annotation correction by domain experts. Top scoring teams reached very competitive results for disease (F1-score 0.8199) and medication mentions (0.9277) in Spanish and also obtained very competitive scores for English (F1-score 0.9223) and Italian (F1-score 0.8842). These results suggest that adaptation of general clinical NLP components to a specific clinical specialty can improve the overall results and that cross-language adaptation of clinical NLP components using neural translation and expert-in-the-loop annotation might speed up the implementation of clinical entity extraction systems. The MultiCardioNER corpora, as well as a silver standard made up of predictions of participating systems over the background set, are available at: https://zenodo.org/records/11368861. Keywords named entity recognition, cardiology, subdomain adaptation, multilingual, clinical NLP 1. Introduction Cardiovascular diseases (CVDs) represent a leading cause of death and morbidity worldwide and, therefore, are responsible for considerable disability costs every year. Cardiology is a medical field with CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ salvador.limalopez@bsc.es (S. Lima-López); eulalia.farre@bsc.es (E. Farré-Maduell); jan.rodriguez@bsc.es (J. Rodríguez-Miret); livia.lilli@policlinicogemelli.it (L. Lilli); jacopo.lenkowicz@policlinicogemelli.it (J. Lenkowicz); g.ceroni@ucl.ac.uk (G. Ceroni); j.kossoff@nhs.net (J. Kossoff); a.shah@ucl.ac.uk (A. Shah); tasosnent@iit.demokritos.gr (A. Nentidis); akrithara@iit.demokritos.gr (A. Krithara); gkatsibras@iit.demokritos.gr (G. Katsimpras); paliourg@iit.demokritos.gr (G. Paliouras); mkrallin@bsc.es (M. Krallinger)  0000-0002-7384-1877 (S. Lima-López); 0009-0000-0793-981X (J. Rodríguez-Miret); 0009-0000-0188-079X (M. Rodríguez-Ortega); 0009-0005-3319-7211 (L. Lilli); 0000-0002-8366-1474 (J. Lenkowicz); 0000-0002-8907-5724 (A. Shah); 0000-0002-3782-4412 (A. Nentidis); 0000-0003-0491-4507 (A. Krithara); 0000-0003-3697-941X (G. Katsimpras); 0000-0001-9629-2367 (G. Paliouras); 0000-0002-2646-8782 (M. Krallinger) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings its own concepts and expressions of great relevance, as cardiovascular diseases are highly prevalent worldwide. Analysis of unstructured medical data, such as clinical notes or medical publications, may provide an opportunity to improve the characterization of cardiac pathologies. The extraction of clinical variables from medical content is key to enabling healthcare data analytics, improving patient care and advancing precision medicine. Information contained only as free text within Electronic Medical Records is currently mostly unused due to the difficulty of extracting relevant data from very diversely written data sources. Additionally, the distinctive language of each medical specialty calls for more specialized automatic semantic annotation resources in English and any language used in clinical services. Due to its importance and the need to improve the extraction, use, and ultimately exploitation of patient data suffering from cardiovascular conditions, efforts have been made to implement natural language processing (NLP) solutions to classify or extract key variables from cardiology clinical content. Using results generated by NLP technologies might contribute to improving outcomes and understanding disease in cardiology. In order to account for the diversity and heterogeneity of NLP research applied to cardiology, several review articles have tried to systematically characterize the various NLP application scenarios adapted to handle cardiovascular disease medical documents [1]. These included applications related to heart failure [2], coronary artery disease, general cardiology or valvular heart disease [3, 4]. Efforts were also made to extract, by means of NLP tools, symptoms [5, 6], vital signs [7], heart function measurements [8], risk factors [4], or cardiovascular comorbidities [9] as well as social risk factors [10] or diagnostic codes of common cardiovascular diseases [11]. Some other attempts were also made to explore the use of NLP approaches to extract Framingham criteria [12] or New York Heart Association classifications from unstructured clinical notes [13, 14, 15]. General clinical domain pre-training does not necessarily transfer well to all medical sub-specialties or disciplines because of the use of highly specialized medical language, as encountered in cardiology clinical case reports or cardiology clinical notes. Domain adaptation strategies may have a great potential to improve NLP solutions for practical settings, real-world scenarios and industrial applications [16]. Also, adaptation of clinical NLP solutions across languages other than English is necessary and requires collaboration between researchers to accelerate progress in non-English clinical NLP [17]. In the case of Spanish, general clinical NLP datasets and resources, such as the DisTEMIST, SympTEMIST, PharmaCoNER, and MedProcNER corpora and systems, have been released. How- ever, (a) the interplay and complementarity of multi-label entity extraction approaches were neither targeted nor evaluated, and (b) how such approaches could be adapted to handle multiple languages was not tested. To address these issues and promote the development of comparable clinical NLP components adapted to a specific clinical domain across several languages, we have organized the MultiCardioNER shared task. This paper presents an overview of the data, methodologies and results of MultiCardioNER. It is structured as follows: Section 2 introduces the shared task, including its sub-tasks and evaluation methods. Next, Section 3 describes the different corpora used as part of MultiCardioNER, namely DisTEMIST, DrugTEMIST and CardioCCC, as well as other associated resources, while Section 4 presents the participation results and proposed methodologies. Finally, Section 5 concludes the paper with a discussion of some of the most interesting aspects, learned lessons, future work and more. 2. Task description 2.1. Shared task description The MultiCardioNER task participants were asked to implement named entity recognition (NER) systems using a general clinical corpus annotated with disease and medication mentions. They were then required to adapt these NER systems to a particular medical specialty, namely cardiology. In addition, the MultiCardioNER task also explored the creation of clinical multilingual NER components or cross-language adaptation of these systems for three languages: Spanish, English and Italian. The MultiCardioNER task relied on a previous resource exploited in a past shared task, called DisTEMIST [18]. The DisTEMIST corpus is a collection of 1,000 clinical case reports covering a wide range of specialties annotated for diseases by clinical experts. Furthermore, a previously unreleased corpus called DrugTEMIST was published as part of the task. The DrugTEMIST corpus provides drug or medication mention annotations for the same collection of clinical case reports as used for the DisTEMIST dataset. For the adaptation to cardiology, we have constructed the CardioCCC corpus, a new dataset that consists of manually selected cardiology clinical case reports showing similar characteristics as cardi- ology discharge summaries. This resource was provided to participants to enable the exploration of different clinical subdomain/specialty adaptation strategies and to benchmark the resulting systems. To foster the generation of multilingual clinical NER corpora, DrugTEMIST and CardioCCC were automatically translated from Spanish into both English and Italian. The Gold Standard drug mention annotations were then mapped into both target languages and validated manually by clinical experts (native speakers of English and Italian). The three corpora and the underlying annotation projection process are described in more detail in Section 3. The evaluation process relied on the comparison of participating team predictions against the manual annotations previously done by the clinical experts. Each team was allowed to submit up to 5 runs for each subtrack and language. The evaluation process and metrics are reported in Section 2.3. 2.2. Subtracks MultiCardioNER was structured into two different subtracks: • Subtrack 1 (CardioDis). This track focuses on the adaptation of disease recognition systems to the cardiology specialty in Spanish. Participants could use the DisTEMIST corpus [18] as a base training set, together with a new collection of cardiology-specific clinical case reports annotated with diseases (CardioCCC) that could be used to fine-tune or adapt their systems to cardiology case reports. • Subtrack 2 (MultiDrug). This subtrack focuses on the multilingual or cross-language adaptation (Spanish, English and Italian) of medication recognition systems, specifically for cardiology clinical case reports. For this track, participants could use the DrugTEMIST dataset as NER training resource. This corpus can be seen as a complementary dataset to the previously-released DisTEMIST, ProcTEMIST and SympTEMIST corpora, as it incorporates annotations of medications for the same document collection. To enable adaptation to cardiology, the CardioCCC corpus was annotated with medication mentions and divided into development and test subsets. While the original versions of both datasets were created using Spanish texts, a machine-translated version in English and Italian was revised by hand and annotated by clinical experts. 2.3. Evaluation The task was divided into distinct phases: training and test set prediction (evaluation). During the training phase, participants were provided with the DisTEMIST and DrugTEMIST datasets, as well as a subset of the CardioCCC corpus made up of 258 documents. The second batch of the CardioCCC collection was used as test set and released together with a larger background set to make sure that no manual post-editing was carried out by the teams and that the submitted systems could scale up to process larger data collections. These collections (test and background set) were released approximately one month after the start of the training phase. Participants were given two weeks to generate predictions for all documents. They were then evaluated using the Gold Standard annotations of the CardioCCC test set, reserving the predictions for the background set to create a participants’ Silver Standard (discussed in Section 3.4). It was not mandatory to submit results for all three languages. Both MultiCardioNER subtracks were evaluated using micro-averaged precision, recall and F1-score. These metrics are calculated as follows: Table 1 Results of the baseline system (vocabulary transfer) for the two MultiCardioNER subtracks Subtrack Language System Name Precision Recall F1 CardioDis Spanish DisTEMIST vocabulary transfer 0.5178 0.3681 0.4303 MultiDrug Spanish DrugTEMIST vocabulary transfer 0.6366 0.7148 0.6734 MultiDrug English DrugTEMIST vocabulary transfer 0.3317 0.7269 0.4556 MultiDrug Italian DrugTEMIST vocabulary transfer 0.3320 0.6844 0.4471 True Positives Precision (P) = True Positives + False Positives True Positives Recall (R) = True Positives + False Negatives 2 * (𝑃 * 𝑅) F1 score (F1) = (𝑃 + 𝑅) As part of the task, an official MultiCardioNER evaluation library was released and is available on GitHub1 . After the task results were released, the test set Gold Standard annotations were shared with participating teams to enable them to perform extra experiments and facilitate error analysis of their systems. 2.4. Baseline To provide a baseline system for comparison, we used a simple vocabulary transfer approach that relied on generating a gazetteer of entities from the training sets (DisTEMIST/DrugTEMIST corpora), and carrying out dictionary look-up of these terms in the test set. Specifically, the system is a lexical lookup approach that tries to find the annotated strings in both corpora within the cardiology test set. The baseline results are shown in Table 1. 3. Corpus and resources The MultiCardioNER task leverages an already-existing corpus, DisTEMIST, as well as two new releases, DrugTEMIST and CardioCCC. DisTEMIST and DrugTEMIST share the same document collection, which consists of clinical case reports from various clinical specialties such as oncology, infectious diseases, urology and psychiatry. This collection of texts has also been used for the procedures corpus MedProcNER/ProcTEMIST [19] and the signs and symptoms corpus SympTEMIST [20]. These corpora could be considered complementary since they have been annotated by the same clinical experts using the same methodology, which includes the creation of dedicated annotation guidelines. They were released as part of previous shared tasks in an effort to promote the development and accessibility of annotated resources for clinical information extraction in Spanish validated by clinical experts. Other resources resulting from this initiative include PharmaCoNER [21], LivingNER [22], MEDDOPROF [23] or MEDDOPLACE [24]. The CardioCCC corpus consists of cardiology-specific clinical case reports. It includes annotations for diseases and drugs created using the same guidelines as DisTEMIST and DrugTEMIST. Although all three corpora were created originally in Spanish, the texts and annotations related to drugs were translated into English and Italian and released for this task. Table 2 provides some statistics for the different datasets that make up MultiCardioNER, which are explained in detail in this section. All datasets described in this section are openly available on Zenodo2 . 1 https://github.com/nlp4bia-bsc/multicardioner_evaluation_library 2 https://zenodo.org/doi/10.5281/zenodo.10948354 Table 2 Statistics for the datasets provided for MultiCardioNER. “Annot.” stands for “annotations”, while “Chars” stands for “characters”. Unique annotations refer to the number of distinct annotated strings after converting all annotations to lowercase. The number of tokens has been calculated using the following spaCy models: “es_core_news_sm”, “en_core_web_sm” and “it_core_news_sm”. Mean Mean Unique Dataset Lang. Entity Docs Tokens Chars Annot. Annot. Annot. Annot. Tokens Chars DisTEMIST ES Diseases 1,000 406,137 2,335,968 10,664 6,739 3.20 ± 2.98 24.76 ± 18.89 DrugTEMIST ES Drugs 1,000 406,137 2,335,968 2,778 925 1.19 ± 0.56 11.34 ± 4.46 EN Drugs 1,000 404,194 2,230,631 2,814 875 1.25 ± 0.66 11.26 ± 0.52 IT Drugs 1,000 421,251 2,393,002 2,808 893 1.25 ± 0.69 11.49 ± 4.73 CardioCCC ES Diseases 508 568,297 3,215,774 18,232 7,692 3.32 ± 2.84 26.28 ± 19.06 ES Drugs 508 568,297 3,215,774 4,227 755 1.19 ± 0.71 11.60 ± 5.25 EN Drugs 508 576,772 3,114,833 4,231 734 1.21 ± 0.64 11.37 ± 4.74 IT Drugs 508 595,332 3,345,466 4,385 752 1.23 ± 0.72 11.85 ± 5.25 Figure 1: Excerpt from the DisTEMIST corpus with various annotated diseases. Translation with annotated entities in italics: “A 37-year-old woman diagnosed with AML (acute myeloblastic leukemia) in 2003 following a spontaneous right hemopneumothorax that required surgery with evacuation of the hemothorax and resection of bullous dystrophy. She was followed up on an outpatient basis without incident until 2009 when he presented with chylous ascites and a large retroperitoneal cystic lymphangioma was detected on an abdominal computed tomography (CT) scan. In February 2011 she was admitted for exertional dyspnea and extensive right pleural effusion. Pleural fluid showed characteristics of chylothorax: [...]”. 3.1. DisTEMIST DisTEMIST is a Gold Standard manually annotated corpus of disease mentions in Spanish clinical case documents normalized or mapped to SNOMED CT concept identifiers. It consists of 1,000 clinical case reports written in Spanish from miscellaneous medical specialties. Figure 1 shows an example of an annotated document. The texts in the corpus were derived from SciELO (Scientific Electronic Library Online)3 , an electronic library that contains publications from scientific journals. The texts were manually selected by clinical experts so that their structure and content were clinically relevant and representative. The texts were then pre-processed to extract the appropriate sections of the clinical cases and to remove embedded figure references and citations to be as close as possible to real medical records. These texts were originally released under the name SpaCCC (Spanish Clinical Case Corpus). As shown in Table 2, the text collection includes a total of 406,137 tokens and 2,335,968 characters. In terms of annotations, the corpus includes a total of 10,664 entities, out of which 6,739 are unique after converting them into lowercase. The DisTEMIST corpus was annotated and standardized by two clinical experts from a Spanish 3 http://www.scielo.org tertiary hospital. The annotated mentions and their normalization were post-processed and revised afterwards by a third physician. The annotations were created using the brat tool [25]. Annotation and normalization guidelines were created specifically for this task. The annotation involved discussions between physicians, particularly regarding complex mentions. This, together with multiple rounds of inter-annotator agreement (IAA) through parallel annotation of a section of the corpus (around 20%), resulted in an iterative refinement of the guidelines. After several rounds, a total IAA score of 82.3 (computed as the pairwise agreement between two independent annotators) for the disease mentions was achieved. The result of this process is the DisTEMIST guidelines, openly available on Zenodo4 . The document contains a total of 28 pages describing how to annotate diseases in clinical texts. There are a total of 52 rules divided into various types, such as general, positive or negative. There is also a set of rules specific to oncology mentions that was added as the language used in clinical cases related to this specialty proved to be more specific and harder to annotate. These are partially based on the CANTEMIST corpus [26]. The guidelines also include a discussion of the task’s importance, a corpus characterization, basic information about the task and the annotation process, as well as indications and resources for the annotators. It is noteworthy that the DisTEMIST guidelines have been adapted to other domains, such as social media [27]. The DisTEMIST text documents are in plain text format with UTF-8 encoding. The annotations are presented in two different stand-off versions. The first version includes the original annotation files as outputted by brat [25]. These are .ann files, one for each text file, where each line represents an annotation, including its label, its start and end position and its associated text. The second version is a single tab-separated file (.tsv) which includes all annotations in the corpus. Similarly to the .ann files, this version includes one annotation per row with an additional field for the corresponding filename. For MultiCardioNER, all 1,000 documents in the corpus are presented together. For anyone who wishes to use the original train/test split of the corpus (consisting of 750 and 250 documents, respectively), we advise downloading the original DisTEMIST Gold Standard5 to retrieve the list of filenames belonging to each split. The original repository also includes the SNOMED CT mappings for the annotated mentions, as well as some additional data, such as a background set of related clinical documents and a Silver Standard of the corpus in 6 languages (English, Portuguese, Catalan, Italian, French and Romanian), created using annotation projection. The annotation projection methodology is described in Section 3.2, as well as in the original paper [18]. 3.2. DrugTEMIST DrugTEMIST is a collection of 1,000 clinical case reports from various clinical specialties annotated with mentions of medications. Figure 2 shows an excerpt of an annotated document from the corpus. The corpus uses the same collection of texts as DisTEMIST, which is also shared by MedProcN- ER/ProcTEMIST [19] and SympTEMIST [20]. Unlike those corpora, DrugTEMIST hadn’t been previously released and is one of the novelties of the MultiCardioNER task. Again, the corpus includes a total of 406,137 tokens and 2,335,968 characters, as well as 2,778 annotated entities (925 unique after converting them to lowercase). Similarly to the DisTEMIST corpus, dedicated annotation guidelines were written to define what should be considered a medication and how to perform the annotations. These guidelines were created and refined using the same methodology used for DisTEMIST, including thorough discussions between physicians and the annotation of a sample of the corpus (around 20%). The final IAA of the corpus is 0.955. The DrugTEMIST annotation guidelines are also available in Zenodo6 . They contain 17 pages and are quite similar to the DisTEMIST guidelines, with a total of 29 rules. The release format of the corpus is the same as that of DisTEMIST. 4 https://zenodo.org/doi/10.5281/zenodo.6458078 5 https://zenodo.org/doi/10.5281/zenodo.6408476 6 https://zenodo.org/doi/10.5281/zenodo.11065432 Figure 2: Excerpt from the DrugTEMIST corpus with various annotated medications. Translation with annotated entities in italics: “An 82-year-old woman with a history of breast neoplasia treated with surgery and hormone therapy 20 years ago, hypertensive cardiomyopathy in sinus rhythm, hypercholesterolemia and moderate chronic hyponatremia around 133 mmol/L. She was treated with torasemide 5 mg/24h, isosorbide mononitrate 50 mg/24h, acetylsalicylic acid 100 mg/24h, pravastatin 20 mg/24h, candesartan 32 mg/24h, hydrochlorothiazide 12.5 mg/24h, atenolol 50 mg/24h and spironolactone 25 mg/24h”. The original Gold Standard of the corpus was created in Spanish. For the multilingual part of the task, we created versions of the corpus in English and Italian using annotation projection techniques. These two languages were chosen due to their relevance for other related projects and the availability of clinical experts fluent in each language, who performed a manual revision of all documents to validate the annotation and the quality of the translation. Specifically, the annotation projection methodology consisted of the following steps: 1. An automatic translation of the Spanish documents was carried out (for the previous DisTEMIST task) using high-quality commercial machine translation systems. In a separate step, the Gold Standard annotations were translated without context (i.e. as a plain list of strings). 2. The translated annotations were next transferred into each document using a look-up system. For each document, only the annotations that existed in the original Gold Standard were looked up to prevent introducing false positives. The result of this step is an automatically annotated version of the corpus in each language, which could be considered a Silver Standard. 3. In order for the corpus to be used as a Gold Standard, a manual revision was performed. Experts compared the original Spanish version of the documents with the version in English and Italian using brat’s side-by-side comparison mode. They were tasked with correcting existing and adding new mentions if necessary to make the annotation as close as possible to the original. Additionally, these experts were asked to provide alternative translations to annotated entities that were incorrectly translated. 4. A post-processing step incorporated the alternative translations suggested by the annotators. These translations replaced the original annotated entity both in the text and in the annotation files. Statistics about the English and Italian versions of DrugTEMIST are also provided in Table 2. We should underscore that the different versions of the corpus do not contain the exact same number of annotations. This is mostly due to translation differences and errors introduced by the machine translation system. 3.3. CardioCCC CardioCCC (which stands for Cardiology Clinical Case Corpus) is a collection of 508 cardiology clinical case reports. The documents were retrieved from open-access cardiology journals in Spanish. Within these journals, we tried to manually locate clinical case reports that would have a similar structure to real clinical health records. The candidates were then extracted and, in a similar fashion to the other two corpora presented so far, pre-processed to keep only the relevant article sections and to remove references to figures and tables. The cases were then revised by a clinical expert to confirm their validity. Figure 3 shows a parallel example of the drug annotations in Spanish, English and Italian. (a) Example in English. (b) Example in Spanish. (c) Example in Italian. Figure 3: Excerpt from the CardioCCC drug annotations in all three languages taken from the same document. Table 3 Statistics for the two splits of the CardioCCC corpus. CardioCCC_dev refers to the first batch of the corpus, which participants were allowed to use freely during the training phase. CardioCCC_test refers to the held-out test set used for evaluation. “Annot.” stands for “annotations”, while “Chars” stands for “characters”. Unique annotations refer to the number of distinct annotated strings after converting all annotations to lowercase. The number of tokens has been calculated using the following spaCy models: “es_core_news_sm”, “en_core_web_sm” and “it_core_news_sm”. Mean Mean Unique Dataset Lang. Entity Docs Tokens Chars Annot. Annot. Annot. Annot. Tokens Chars CardioCCC_dev ES Diseases 258 315,047 1,777,279 10,348 4,749 3.34 ± 2.72 26.67 ± 18.47 ES Drugs 258 315,047 1,777,279 2,510 526 1.21 ± 0.74 11.67 ± 5.34 EN Drugs 258 319,289 1,718,585 2,510 513 1.22 ± 0.66 11.48 ± 4.85 IT Drugs 258 330,291 1,848,888 2,585 520 1.23 ± 0.73 11.82 ± 5.31 CardioCCC_test ES Diseases 250 253,250 1,438,495 7,884 3,784 3.29 ± 3.00 25.76 ± 19.80 ES Drugs 250 253,250 1,438,495 1,718 465 1.17 ± 0.66 11.51 ± 5.12 EN Drugs 250 257,483 1,396,248 1,721 460 1.20 ± 0.61 11.21 ± 4.58 IT Drugs 250 265,041 1,496,578 1,800 469 1.22 ± 0.71 11.89 ± 5.16 The corpus contains annotations for diseases and drugs, which were created following the same guidelines used for DisTEMIST and DrugTEMIST. The main annotator for CardioCCC was the same clinical expert who did the final annotation and revision step for the other two corpora, which was a big asset in accelerating the corpus annotation process. As with DrugTEMIST, the corpus’s texts were translated from Spanish into English and Italian using machine translation. The Gold Standard drug annotations were also transferred into English and Italian via annotation projection and revised by clinical experts who are native speakers of each language. As explained in Section 2.3, CardioCCC was released in two batches: one for training/development and another for evaluation. The statistics for these two parts are presented in Table 3, while Table 2 presents the statistics of the complete corpus. In terms of content, as shown by Table 2, the corpus contains 568,297 tokens and 3,215,774 characters. Despite having about half the documents as the SpaCCC corpus (i.e. the texts in DisTEMIST/DrugTEMIST), CardioCCC contains over 150,000 more tokens and one million more characters, meaning the documents are quite longer. This is also reflected in the number of annotations, with CardioCCC having around 8,000 more annotated diseases and 1,500 more drugs. Notably, despite the higher total number of annotations, CardioCCC contains fewer unique drug mentions (which is calculated by converting all annotations to lowercase). This might be due to the fact that in CardioCCC, drug mentions are usually more limited to cardiology-specific medications, while in DrugTEMIST, there is a wider variety of medications mentioned due to the varied clinical specialties it contains. As for the length of annotations themselves, all corpora seem to have a similar distribution in terms of character and token length. The high standard deviation with respect to the mean, especially for diseases, indicates that there’s a number of long annotations in the datasets. 3.4. Background set In addition to the three annotated corpora, an additional dataset was released as a background set. This dataset contains 7,625 text documents, both from the cardiology subdomain and other clinical specialties. While most documents were originally written in Spanish, some of them were also originally in English and Italian. All documents were translated to the other languages to have a comparable background set in all three languages. Together with the background set, we release a tab-separated values (.tsv) file that specifies the original language of each document and whether they belong to the cardiology domain or not. As part of the task’s evaluation period, participants were asked to create predictions for diseases and drugs using their systems. Their predictions were then used to create a Silver Standard, which we release in three different versions: 1. All mentions are kept, with the label name reflecting the team and run the prediction belongs to. This version inevitably includes many incorrect and redundant annotations. 2. Only predictions that have some overlap with the predictions of a different run are used. The overlapping annotations are then merged under a single annotation and a new label name. This version should have a reduced number of incorrect annotations, although some of the “correct” annotations might have extension problems, such as being too short or too long. 3. Only predictions that have a complete overlap with another prediction of a different run are used. This should, in theory, contain the highest number of correct annotations. Table 4 shows some basic statistics about the text documents included within the Silver Standard. This new dataset can have multiple uses, such as bootstrapping manual annotations, system training using semi-supervised learning techniques or errors and data analysis, amongst others. Table 4 Statistics of the documents in the background set. Language Documents Tokens Characters Spanish 7,625 3,863,801 22,066,533 English 7,625 3,857,831 21,130,044 Italian 7,625 4,015,920 22,782,246 Table 5 Overview of the teams that participated in MultiCardioNER. In the Affiliation column, A/I stands for academic or industry institution. In the Tasks column, C stands for the CardioDis subtrack and M for the MultiDrug subtrack. Team Name Affiliation Tasks Ref. BIT.UA IEETA, University of Aveiro, Portugal [A] C [28] Technische Universität Wien, Austria & Spanish National Research DataScienceTUW C/M [29] Council (CSIC), Spain [A] Enigma OntoText, Bulgary & Sofia University, Bulgary [I/A] C/M [30] ICUE University of Edinburgh, UK & Imperial College London, UK [A] M [31] NOVALINCS NOVA School of Science And Technology, Portugal [A] C/M [32] PICUSLab Università degli Studi di Napoli Federico II, Italy [A] C [33] Siemens Advanta, Romania & Transilvania University of Brasov, Siemens C/M [34] Romania [I/A] 4. Results 4.1. Participation overview A total of 31 teams registered for the MultiCardioNER task, out of which 7 teams submitted at least one run of their predictions. The participating teams originate from 8 different countries (some include collaborations between teams from different countries), and except for one group from the industry, the rest belong to academia. Table 5 shows the complete list of participating teams, along with their affiliation and the reference to their task paper. As for the participation in each subtrack, 6 teams participated in the CardioDis subtrack, while 5 teams participated in the MultiDrug subtrack (with one of those teams participating only in the Spanish part). Overall, a total of 70 runs were submitted, with each team allowed up to 5 runs per subtrack and language: 20 for the CardioDis subtrack, 18 for the Spanish MultiDrug, 16 for the English MultiDrug and 16 for the Italian MultiDrug. 4.2. System results All in all, the top scores for each subtrack were: • Subtrack CardioDis. The team BIT.UA attained the top position with an ensemble of RoBERTa- based models (roberta-es-clinical-trials-ner) that also uses a multi-head-CRF approach [35]. Their runs integrated the provided datasets in different ways, with the highest scores achieved by the models that use both the DisTEMIST and CardioCCC data. Their best run achieved an F1-score of 0.8199 and a recall of 0.8243. The team with the next best F1-score (0.8049) is Enigma, which uses a CLIN-X-ES model also fine-tuned on the DisTEMIST and CardioCCC data. Interestingly, the team PICUSLab achieves the best precision (0.8886) by a wide margin by combining the predictions of multiple models trained on different parts of the data (including an augmented version of the CardioCCC corpus) and then using string matching techniques to enhance the final predictions. • Subtrack MultiDrug. In Spanish, the best F1-score is achieved by the ICUE team (0.9277), who also achieved the best recall (0.9412). Meanwhile, in English and Italian, the winning team is Enigma, with an F1-score of 0.9223 and 0.8842, respectively. The results for the CardioDis subtrack are shown in Table 6, while the results for the MultiDrug subtrack are presented in Table 7 for Spanish, Table 8 for English and Table 9 for Italian. 4.3. Methodologies This section describes the methodologies used by each team, which are also summarized in Table 10. Table 6 Results of the MultiCardioNER CardioDis subtrack, sorted by F1-score. The best result is bolded, and the second-best is underlined. Team Name Run name Precision Recall F1 BIT.UA run1-all-full 0.8155 0.8243 0.8199 BIT.UA run0-top5-full 0.811 0.8181 0.8145 Enigma 3-system-CLIN-X-ES-pretrained 0.8016 0.8082 0.8049 Enigma 2-system-CLIN-X-ES-14 0.8052 0.8007 0.803 PICUSLab aug_fus_sub2 0.7794 0.803 0.791 BIT.UA run4-all 0.7981 0.7827 0.7903 Enigma 1-system-CLIN-X-ES-12 0.7827 0.7938 0.7882 PICUSLab aug_fus_sub1 0.7346 0.7799 0.7566 BIT.UA run3-all-val 0.7544 0.7588 0.7566 BIT.UA run2-best-val 0.748 0.7542 0.7511 DataScienceTUW run4-roberta-dg 0.6565 0.7376 0.6947 DataScienceTUW run5-roberta-dg-windows 0.6546 0.7244 0.6877 Siemens run1_SDR 0.6758 0.6437 0.6593 PICUSLab aug_fus_sm_sub2 0.8919 0.4897 0.6323 DataScienceTUW run1_mdeberta-ct-mlm-dg 0.5928 0.6715 0.6297 PICUSLab aug_fus_sm_sub1 0.8886 0.4744 0.6185 DataScienceTUW run2-mdeberta-ct 0.5027 0.6884 0.581 DataScienceTUW run3_mdeberta-ct-dg 0.48 0.6773 0.5618 NOVALINCS 1_bsc-bio-ehr-es_distemist_4 0.8018 0.3525 0.4897 NOVALINCS 2_bsc-bio-ehr-es_distemist_1 0.8183 0.3398 0.4802 Table 7 Results of the MultiCardioNER MultiDrug subtrack in Spanish, sorted by F1-score. The best result is bolded, and the second-best is underlined. Team Name Run name Precision Recall F1 ICUE run2_single_pp 0.9146 0.9412 0.9277 ICUE run4_GPT_translation 0.9146 0.9412 0.9277 ICUE run5_GPT_translation_all 0.9146 0.9412 0.9277 Enigma 3-system-SpanishRoBERTa 0.913 0.9348 0.9238 Enigma 1-system-XLMR 0.904 0.9208 0.9123 Enigma 2-system-XLMR-filtering 0.9148 0.9005 0.9076 ICUE run3_single 0.8777 0.9272 0.9018 Siemens run1_SMR 0.8928 0.8778 0.8852 ICUE run1_multilingual_pp 0.8287 0.9348 0.8786 Enigma 5-system-XLMR-filtering-dict2 0.7654 0.8871 0.8218 NOVALINCS 3_bsc-bio-ehr-es_drugtemist_4 0.9242 0.4965 0.646 NOVALINCS 4_bsc-bio-ehr-es_drugtemist_1 0.9076 0.4919 0.638 DataScienceTUW run3_roberta-ct-multilingual 0.8705 0.4342 0.5794 Enigma 4-system-XLMR-filtering-dict1 0.4351 0.7899 0.5611 DataScienceTUW run5_roberta-ct-mlm 0.8421 0.3912 0.5342 DataScienceTUW run4_mdeberta_ct_mlm_dg 0.6815 0.3836 0.4909 DataScienceTUW run2_mdeberta-ct-multilingual 0.7647 0.3556 0.4855 DataScienceTUW run1_mdeberta-multilingual 0.3914 0.1531 0.2201 • Team BIT.UA. For subtrack CardioDis, this team builds on some of their previous work, namely the Multi-Head- CRF approach [35], which introduces a Multi-Head Conditional Random Field (CRF) classifier on top of a multi-class NER system. Starting from the “roberta-es-clinical-trials-ner” pre-trained Table 8 Results of the MultiCardioNER MultiDrug subtrack in English, sorted by F1-score. The best result is bolded, and the second-best is underlined. Team Name Run name Precision Recall F1 Enigma 3-system-BioLinkBERT 0.8981 0.9477 0.9223 ICUE run2_single_pp 0.9086 0.9128 0.9107 ICUE run4_GPT_translation 0.9086 0.9128 0.9107 Enigma 1-system-XLMR 0.8823 0.9233 0.9023 Enigma 2-system-XLMR-filtering 0.9031 0.8989 0.901 Enigma 5-system-XLMR-filtering-dict2 0.8698 0.9047 0.8869 ICUE run3_single 0.8734 0.8977 0.8854 ICUE run1_multilingual_pp 0.8314 0.9343 0.8799 Siemens run1_EMR 0.8685 0.8791 0.8738 Enigma 4-system-XLMR-filtering-dict1 0.8298 0.921 0.873 ICUE run5_GPT_translation_all 0.8767 0.8635 0.87 DataScienceTUW run3_roberta-ct-multilingual 0.8632 0.4364 0.5797 DataScienceTUW run4-mdeberta-windows 0.7955 0.4317 0.5597 DataScienceTUW run5-biobert-mlm-windows 0.6771 0.441 0.5341 DataScienceTUW run2_mdeberta-ct-multilingual 0.8453 0.3777 0.5221 DataScienceTUW run1_mdeberta-multilingual 0.5648 0.2481 0.3448 Table 9 Results of the MultiCardioNER MultiDrug subtrack in Italian, sorted by F1-score. The best result is bolded, and the second-best is underlined. Team Name Run name Precision Recall F1 Enigma 1-system-XLMR 0.884 0.8844 0.8842 Enigma 3-system-Italian-Spanish-RoBERTa 0.8723 0.8956 0.8838 Enigma 2-system-XLMR-filtering 0.9016 0.8606 0.8806 Siemens run1_IMR 0.8891 0.8689 0.8789 ICUE run4_GPT_translation 0.9114 0.8461 0.8776 ICUE run5_GPT_translation_all 0.9114 0.8461 0.8776 ICUE run2_single_pp 0.8186 0.9 0.8574 ICUE run1_multilingual_pp 0.8139 0.8867 0.8487 ICUE run3_single 0.7879 0.8894 0.8356 Enigma 4-system-XLMR-filtering-dict1 0.5693 0.8578 0.6844 Enigma 5-system-XLMR-filtering-dict2 0.5707 0.845 0.6813 DataScienceTUW run3_roberta-ct-multilingual 0.8264 0.4206 0.5574 DataScienceTUW run4-mdeberta 0.7481 0.3928 0.5151 DataScienceTUW run5-biobit-mlm 0.7922 0.3517 0.4871 DataScienceTUW run2_mdeberta-ct-multilingual 0.7433 0.3394 0.4661 DataScienceTUW run1_mdeberta-multilingual 0.5074 0.2094 0.2965 model7 , they present 5 runs of ensembled models, with some runs consisting on models fine-tuned only with the DisTEMIST dataset and others with DisTEMIST plus CardioCCC. Their best run is an ensemble of 17 systems trained on both corpora, which achieves the highest F1-score of the subtrack (0.8199). • Team Data Science TUW. This team uses four main strategies throughout their experiments for both subtracks: pre-training via MLM (Masked Language Modelling), data augmentation, sliding windows with overlap and additional pre-training on general diseases and drugs using other corpora. The pre-trained models they use include the multilingual mDeBERTa [36, 37], the Spanish “roberta-es-clinical-trials-ner”, 7 https://huggingface.co/lcampillos/roberta-es-clinical-trials-ner Table 10 General overview of the approaches presented by participants for the MultiCardioNER task. “*TEMIST corpora” refers to the joint version of the DisTEMIST, SympTEMIST, ProcTEMIST and DrugTEMIST corpora. Team Task Approaches Ensemble of RoBERTa models with multi-head CRF and differences in BIT.UA CardioDis the data used for training (only DisTEMIST or DisTEMIST + CardioCCC) Data Science Transformer-based models with different pretraining settings, data CardioDis TUW augmentation and window sliding Multilingual and language-specific Transformers with different MultiDrug pretraining settings, data augmentation and window sliding CLIN-X-ES model fine-tuned on the entire task data + custom clinical Enigma CardioDis dataset Multilingual and language-specific Transformers fine-tuned on the entire MultiDrug task data + custom drug dictionary Multilingual and language-specific BERT models with re-training, ICUE MultiDrug post-processing rules + GPT 3.5 RoBERTa model fine-tuned on the standalone DisTEMIST corpus vs. NOVALINCS CardioDis joint *TEMIST corpora RoBERTa model fine-tuned on the standalone DrugTEMIST corpus vs. MultiDrug joint *TEMIST corpora Ensemble of Transformer-based models trained on different datasets, PICUSLab CardioDis including an augmented version of CardioCCC + post-processing via string matching Siemens CardioDis Fine-tuned general domain BERT model MultiDrug Fine-tuned language-specific general domain BERT models the English “biobert_chemical_ner”8 and the Italian BioBIT [38]. An important note about this team’s results is that they had some problems with their submission that caused the overall low results. This is addressed in their system description, in which they re-evaluate their models with much better results, comparable to some of the task’s best. • Team Enigma. For subtrack CardioDis, team Enigma fine-tuned a CLIN-X-ES model [39] on the DisTEMIST and CardioCCC corpora for a different number of epochs. One of their runs further pre-trains the model using Spanish Wikipedia pages and datasets from different challenges, achieving them a spot in the subtrack’s top three F1-scores (0.8049). For subtrack MultiDrug, the team uses a combination of different models, including a multilingual XLM-RoBERTa [40] and language-specific models such as a Spanish RoBERTa [41] (which they also use for Italian) and BioLinkBERT for English [42]. Their first run, which uses the multilingual XLM- RoBERTa, pre-trains the model on a custom multi-lingual dataset (including biomedical challenge data, European drug description data, Wikipedia) and then fine-tuned for token classification on all data for all languages. For Italian, this approach achieves them the highest F1-score of the Italian part of the subtrack (0.8842). Their second run uses the same system but adds a classifier before it, which determines if there are any drugs in the sentence. For Spanish and English, their best run is the third one, which uses a language-specific model. This is not the case, however, for their third Italian run, which uses a Spanish model pre-trained on Italian data. Another interesting contribution by this team is the combination of neural systems and drug dictionaries obtained from resources such as DrugBank, ATC, DrugCentral or the NIHS. The two runs that use this approach achieve very good results, although not as good as their other ones. 8 https://huggingface.co/alvaroalon2/biobert_chemical_ner • Team ICUE. For the MultiDrug subtrack, this team compares the effectiveness of multilingual and monolingual BERT models. They also experiment with the inclusion of post-processing rules (specifically for composite drug mentions in Spanish), as well as with using Large Language Models (LLMs) such as GPT-3.5 [43] to translate predictions in Spanish to the other two languages. Their methodology achieves very good results, especially when they use monolingual models. In Spanish, they achieve the best F1-score (0.9277). It is noteworthy that some of their runs in the results table are repeated since they presented the same system with changes only for some languages. Team ICUE also includes some additional experiments in their system description paper, such as using GPT-3.5 and LLaMA [44] for entity recognition with competitive results. • Team NOVALINCS. For CardioDis, this team fine-tunes the “bsc-bio-ehr-es” pre-trained RoBERTa9 using the Dis- TEMIST corpus. They prepared two runs: one in which they only use the DisTEMIST annotations and another in which they also incorporate the other 3 entities from the complementary corpora (that is, procedures from MedProcNER/ProcTEMIST, symptoms from SympTEMIST and medica- tions from DrugTEMIST). For MultiDrug, they only participated in the Spanish part using the same methodology, exchanging DisTEMIST with DrugTEMIST. Their overall results for both tasks are remarkable for their high precision and low recall, which may indicate the difficulty of the systems to adapt to the cardiology subdomain using only the general clinical domain data. • Team PICUSLab. For the CardioDis subtrack, this team employs an ensemble transfer learning strategy. They train different models on DisTEMIST, CardioCCC and an augmented version of CardioCCC (created with the help of sentence similarity techniques and a gazetteer), and then fuse the predictions of the different models. To further improve their predictions, they use string matching to post- process them. Their best run earns them a spot in the subtrack’s top 5 with an F1-score of 0.791. • Team Siemens. This team participated in both CardioDis and MultiDrug with the same methodology. They use general domain BERT models (“bert-spanish-cased-finetuned-ner“10 , “bert-base-NER“11 and “bert-italian-finetuned-ner“12 ) and fine-tune them for multi-label token classification using the different MultiCardioNER datasets. Despite not using clinical models, their results are quite good, especially for the MultiDrug subtrack (e.g. 0.8789 F1-score in the Italian part). In their overview paper, they also perform additional experiments that were not evaluated during the task’s evaluation phase. 5. Discussion Comparison with previous tasks. MultiCardioNER is a novel task built upon the foundation of previous tasks and resources. In recent years, tasks such as DisTEMIST [18], PharmaCoNER [21] or MedProcNER [19] have provided the Spanish NLP community with a variety of corpora for the recognition (and normalization) of named entities in clinical texts. These corpora have progressively become reference corpora used to benchmark and model pre-training efforts [39, 45, 46, 47, 48, 49]. MultiCardioNER is different from these previous tasks in that it uses data from a single clinical specialty, rather than a general medical dataset. The CardioCCC corpus could become a reference for cardiology and subdomain adaptation in clinical NLP in Spanish. The corpus is expected to expand with the addition of case reports, more entity types (such as procedures and symptoms), and more languages. 9 https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es 10 https://huggingface.co/mrm8488/bert-spanish-cased-finetuned-ner 11 https://huggingface.co/dslim/bert-base-NER 12 https://huggingface.co/nickprock/bert-italian-finetuned-ner Subdomain adaptation is a major goal of MultiCardioNER. The task’s results indicate the importance of using subdomain data to build systems with specific application fields. All top-performing systems incorporate the released 258 documents from the CardioCCC corpus. In contrast, participants that only use the DisTEMIST and DrugTEMIST corpora (consisting of clinical case reports from various specialties) achieve high precision but fail to recall, thus obtaining a comparatively lower F1-score. This suggests that, while these systems are able to retrieve many clinical entities correctly (i.e. high precision), they fail to recover concepts specific to the cardiology subdomain (i.e. low recall). Furthermore, comparing the results of the DisTEMIST shared task [18] with the CardioDis subtrack, the overall results are somewhat better in the latter task: DisTEMIST’s winning team obtained an F1-score of 0.77, while the winning team of MultiCardioNER obtained an F1 of 0.81. This might point to the importance of using specialty-specific data, even within very similar clinical domains. We should underline that compared with DisTEMIST, this task offers a higher volume of training data. While there seems to be a positive correlation with the use of subdomain-specific data, it remains a question whether these improvements can actually be attributed to subdomain adaptation, to differences in each of the tasks’ test sets, or to simply having more data. Similarity between the general domain and the cardiology corpora. Given the task’s focus on subdomain adaptation, and in order to further characterise the cardiology and SpaCCC datasets (i.e. the DisTEMIST and DrugTEMIST texts) of the shared task, a comparison analysis was conducted between these clinical case reports and documents belonging to other medical disciplines. These documents consist of a collection of clinical cases categorised into 22 different specialities with varying text structures and content, including oncology, COVID-specific reports, primary health care, neurology, etc. The data for the other specialties was extracted using the same methodology as for the CardioCCC (cardiology) corpus (explained in Section 3.3). For the analysis, we tried to create a mathematical representation of the different document spe- cialties and their subsequent visualisation in a two-dimensional space. To this purpose, the document embeddings were extracted using the pre-trained language model “roberta-base-biomedical-clinical-es” (RoBERTa-based and trained on a large Spanish biomedical corpus from different sources), resulting in tensors of 𝑛 × 𝑚 dimensions, where 𝑛 is the number of sentences in the document and 𝑚 is the size of the language model (768 for the RoBERTa model). Subsequently, a vector composition technique was employed to process the extracted document embeddings, as described in the work of Amigó et al. [50]. This involved utilising the proposed generalised composition function in Amigó et al. [50] and illustrated in Equation 1. In this expression, the first component determines the vector direction of the sum of two vectors (𝑣⃗1 and 𝑣⃗2 ), while the second component represents its magnitude, which depends on the norm of single vectors and their inner product. By applying this function to pairs of successive sentences in a document and representing them as vectors, we are able to compute and represent each document as a single vector (embedding). In this study we implemented two different composition functions derived from Equation 1, the summation (𝐹𝑠𝑢𝑚 ), obtained when the constants 𝜆 and 𝜇 are equal to 1 and −2 respectively, and 𝐹𝑖𝑛𝑑 , a particularization of Equation 1 when 𝜆 is equals to 1 and 𝜇 to 0. 𝑣⃗1 + 𝑣⃗2 √︁ 𝐹𝜆,𝜇 (𝑣⃗1 , 𝑣⃗2 ) = · 𝜆(‖𝑣⃗1 2 ‖ + ‖𝑣⃗2 2 ‖) − 𝜇⟨𝑣⃗1 , 𝑣⃗2 ⟩ (1) ‖𝑣⃗1 + 𝑣⃗2 ‖ Following the document vector representation and the composition function technique, we imple- mented a t-Distributed Stochastic Neighbour Embedding (t-SNE) algorithm with a perplexity of 30 and a maximum number of iterations of 800 to reduce the dimensionality of the document embed- dings. This statistical method enables the visualisation of high-dimensional document embeddings in lower-dimensional spaces, in this case, two dimensions. Figures 4 and 5 illustrate the scatter plots generated by the applied methodology, utilising the two composition functions previously mentioned, 𝐹𝑠𝑢𝑚 and 𝐹𝑖𝑛𝑑 respectively. Both figures reveal distinct clustering patterns depending on the specialty. Documents belonging to specific specialties form a well-defined cluster (see cardiology i.e. CardioCCC in black), highlighting the fact that each of them Figure 4: Document embeddings representation per each discipline after reduction of their dimensionality to 2-dimensions by applying the t-SNE algorithm and using 𝐹𝑠𝑢𝑚 as the composition function. Figure 5: Document embeddings representation per each discipline after reduction of their dimensionality to 2-dimensions by applying the t-SNE algorithm and using 𝐹𝑖𝑛𝑑 as the composition function. possesses unique features in terms of content and structure. In contrast, documents from the SpaCCC corpus (red points) are scattered across the plot, reflecting their diverse nature. This is due to the fact that they cover a wide range of medical disciplines, such as cardiology (CardioCCC), oncology, urology, pneumology or infectious diseases, among many others. Future work and conclusions. There is a pressing need to promote the development of annotated datasets to generate automatic clinical concept detection tools, not only for a single language but for several languages, following comparable annotation criteria and consistent results across multiple languages. Due to the complexity and considerable workload associated with the manual corpus construction process of clinical content, the use of creative solutions such as neural translation and annotation projection strategies might provide an alternative solution to traditional corpus construction attempts. The results of the MultiCardioNER task indicate that it is feasible to create multilingual clinical corpora and use them to train and generate very competitive clinical NER systems with comparable results across several languages. Moreover, an adaptation of clinical NLP components to specific medical specialties can improve the quality of the resulting systems for real-world scenarios. Typically clinical NLP application scenarios or use cases focus on content related to a particular medical discipline, disease or patient type. In this regard, the MultiCardioNER task also provides useful insights on how to adapt general-purpose clinical NLP systems to the characteristics of a medical specialty of interest. We foresee that the results, resources, and strategies generated through the MultiCardioNER task (both by organizers and participants) might potentially promote also the creation of clinical NLP resources beyond the three chosen languages covered in this track. The MultiCardioNER silver standard corpus of predictions for Spanish, English and Italian could also constitute a valuable resource for data augmentation or corpus construction by manually validating the generated system predictions. The presented annotation projection strategy obviously relies on the sufficient quality of the used medical translation systems. Therefore, systematic efforts to evaluate the quality of neural medi- cal machine translation systems are critical. Initiatives like the Workshop on Machine Translation (WMT) Biomedical Translation shared task has provided insights on the quality and potential of neural translation technologies adapted to translate healthcare documents [51, 52]. Acknowledgments The MultiCardioNER track was funded by Spanish and European projects such as DataTools4Heart (Grant Agreement No. 101057849), AI4HF (Grant Agreement No. 101080430), BARITONE (Proyec- tos de Transición Ecológica y Transición Digital 2021. Expediente Nº TED2021-129974B-C21) and AI4ProfHealth (PID2020-119266RA-I00 MICIU/AEI/10.13039/501100011033). Google was a proud sponsor of the BioASQ Challenge in 2023. Ovid is also sponsoring this edition of BioASQ. The twelfth edition of BioASQ is also sponsored by Elsevier. Atypon Systems Inc. is also sponsoring this edition of BioASQ. References [1] A. Tariq, T. Santos, I. Banerjee, Natural language processing for cardiovascular applications, in: Artificial Intelligence in Cardiothoracic Imaging, Springer, 2022, pp. 231–243. [2] T. Nagamine, B. Gillette, A. Pakhomov, J. Kahoun, H. Mayer, R. Burghaus, J. Lippert, M. Saxena, Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured electronic medical record data, Scientific reports 10 (2020) 21340. [3] M. R. Turchioe, A. Volodarskiy, J. Pathak, D. N. Wright, J. E. Tcheng, D. Slotwiner, Systematic review of current natural language processing methods and applications in cardiology, Heart 108 (2022) 909–916. [4] M. J. Boonstra, D. Weissenbacher, J. H. Moore, G. Gonzalez-Hernandez, F. W. Asselbergs, Artificial intelligence: revolutionizing cardiology with large language models, European Heart Journal 45 (2024) 332–345. [5] R. Vijayakrishnan, S. R. Steinhubl, K. Ng, J. Sun, R. J. Byrd, Z. Daar, B. A. Williams, C. Defilippi, S. Ebadollahi, W. F. Stewart, Prevalence of heart failure signs and symptoms in a large primary care population identified through the use of text and data mining of the electronic health record, Journal of cardiac failure 20 (2014) 459–464. [6] S. Chae, J. Song, M. Ojo, M. Topaz, Identifying heart failure symptoms and poor self-management in home healthcare: a natural language processing study, in: Nurses and Midwives in the Digital Age, IOS Press, 2021, pp. 15–19. [7] S. Khurshid, C. Reeder, L. X. Harrington, P. Singh, G. Sarma, S. F. Friedman, P. Di Achille, N. Dia- mant, J. W. Cunningham, A. C. Turner, et al., Cohort design and natural language processing to reduce bias in electronic health records research, Npj Digital Medicine 5 (2022) 47. [8] O. V. Patterson, M. S. Freiberg, M. Skanderson, S. J. Fodeh, C. A. Brandt, S. L. DuVall, Unlocking echocardiogram measurements for heart disease research through natural language processing, BMC cardiovascular disorders 17 (2017) 1–11. [9] A. N. Berman, D. W. Biery, C. Ginder, O. L. Hulme, D. Marcusa, O. Leiva, W. Y. Wu, N. Cardin, J. Hainer, D. L. Bhatt, et al., Natural language processing for the assessment of cardiovascular disease comorbidities: The cardio-canary comorbidity project, Clinical Cardiology 44 (2021) 1296–1304. [10] J. R. Brown, I. M. Ricket, R. M. Reeves, R. U. Shah, C. A. Goodrich, G. Gobbel, M. E. Stabler, A. M. Perkins, F. Minter, K. C. Cox, et al., Information extraction from electronic health records to predict readmission following acute myocardial infarction: does natural language processing using clinical notes improve prediction of readmission?, Journal of the American Heart Association 11 (2022) e024198. [11] X. Zhan, M. Humbert-Droz, P. Mukherjee, O. Gevaert, Structuring clinical text with ai: Old versus new natural language processing techniques evaluated on eight common cardiovascular diseases, Patterns 2 (2021). [12] H. A. Alhakimi, T. E. Magzoub, Applications of natural language processing in cardiology using text clinical data: A systematic review, Advances in Clinical and Experimental Medicine 10 (2023). [13] R. Zhang, S. Ma, L. Shanahan, J. Munroe, S. Horn, S. Speedie, Automatic methods to extract new york heart association classification from clinical notes, in: 2017 ieee international conference on bioinformatics and biomedicine (bibm), IEEE, 2017, pp. 1296–1299. [14] R. Zhang, S. Ma, L. Shanahan, J. Munroe, S. Horn, S. Speedie, Discovering and identifying new york heart association classification from electronic health records, BMC medical informatics and decision making 18 (2018) 5–13. [15] P. Adejumo, P. Thangaraj, L. S. Dhingra, A. Aminorroaya, X. Zhou, C. Brandt, H. Xu, H. M. Krumholz, R. Khera, A deep learning approach for automated extraction of functional status and new york heart association class for heart failure patients during clinical encounters, medRxiv (2024). [16] P. Singhal, R. Walambe, S. Ramanna, K. Kotecha, Domain adaptation: challenges, methods, datasets, and applications, IEEE access 11 (2023) 6973–7020. [17] E. Laparra, A. Mascio, S. Velupillai, T. Miller, A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records, Yearbook of medical informatics 30 (2021) 239–244. [18] A. Miranda-Escalada, L. Gascó, S. Lima-López, E. Farré-Maduell, D. Estrada, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of DisTEMIST at BioASQ: Automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources (2022). [19] S. Lima-López, E. Farré-Maduell, L. Gascó, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger, Overview of medprocner task on medical procedure detection and entity linking at bioasq 2023, in: Working Notes of CLEF 2023, 2023. [20] S. Lima-López, E. Farré-Maduell, L. Gasco-Sánchez, J. Rodríguez-Miret, M. Krallinger, Overview of SympTEMIST at BioCreative VIII: Corpus, Guidelines and Evaluation of Systems for the Detection and Normalization of Symptoms, Signs and Findings from Text, in: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models, 2023. [21] A. Gonzalez-Agirre, M. Marimon, A. Intxaurrondo, O. Rabal, M. Villegas, M. Krallinger, Pharma- coner: Pharmacological substances, compounds and proteins named entity recognition track, in: Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 2019, pp. 1–10. [22] A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, D. Estrada, L. Gascó, M. Krallinger, Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of livingner shared task and resources, Procesamiento del Lenguaje Natural (2022). [23] S. Lima-López, E. Farré-Maduell, A. Miranda-Escalada, V. Brivá-Iglesias, M. Krallinger, Nlp applied to occupational health: Meddoprof shared task at iberlef 2021 on automatic recognition, classification and normalization of professions and occupations from medical texts, Procesamiento del Lenguaje Natural 67 (2021) 243–256. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/ article/view/6393. [24] S. Lima-López, E. Farré-Maduell, V. Brivá-Escalada, L. Gascó, M. Krallinger, MEDDOPLACE Shared Task overview: recognition, normalization and classification of locations and patient movement in clinical texts, Procesamiento del Lenguaje Natural 71 (2023). [25] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, J. Tsujii, Brat: a web-based tool for nlp-assisted text annotation, in: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 102–107. [26] A. Miranda-Escalada, E. Farré-Maduell, M. Krallinger, Named entity recognition, concept normal- ization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings, 2020. [27] L. Gasco Sánchez, D. Estrada Zavala, E. Farré-Maduell, S. Lima-López, A. Miranda-Escalada, M. Krallinger, The SocialDisNER shared task on detection of disease mentions in health-relevant content from social media: methods, evaluation, guidelines and corpora, in: Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, Association for Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 182–189. URL: https://aclanthology.org/2022.smm4h-1.48. [28] R. Jonker, T. Almeida, S. Matos, BIT.UA at MultiCardioNER: Adapting a Multi-head CRF for Cardiology, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [29] P. Styll, L. Campillos-Llanos, W. Kusa, A. Hanbury, Cross-Linguistic Disease and Drug Detection in Cardiology Clinical Texts: Methods and Outcomes, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [30] A. Aksenova, A. Datseris, S. Vassileva, S. Boytcheva, Transformer-Based Disease and Drug Named Entity Recognition in Multilingual Clinical Texts: MultiCardioNER challenge, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [31] C. Lee, T. I. Simpson, J. M. Posma, A. D. Lain, Comparative Analyses of Multilingual Drug Entity Recognition Systems for Clinical Case Reports In Cardiology, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [32] R. Gonçalves, A. Lamúrias, Team NOVA LINCS @ BIOASQ12 MultiCardioNER Track: Entity Recognition with Additional Entity Types, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [33] A. Romano, G. Riccio, M. Postiglione, V. Moscato, Identifying Cardiological Disorders in Spanish via Data Augmentation and Fine-Tuned Language Models, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [34] M. D. Danu, V. G. Marica, C. Suciu, L. M. Itu, O. Farri, Multilingual Clinical NER for Diseases and Medications Recognition in Cardiology Texts using BERT Embeddings, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024. [35] R. A. A. Jonker, T. Almeida, R. Antunes, J. R. Almeida, S. Matos, Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes, Database (2024). [36] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum? id=XPZIaotutsD. [37] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing, 2021. arXiv:2111.09543. [38] T. M. Buonocore, C. Crema, A. Redolfi, R. Bellazzi, E. Parimbelli, Localizing in-domain adaptation of transformer-based biomedical language models, Journal of Biomedical Informatics 144 (2023) 104431. [39] L. Lange, H. Adel, J. Strötgen, D. Klakow, Clin-x: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain, Bioinformatics 38 (2022) 3267–3274. [40] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116. [41] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained Biomedical Language Models for Clin- ical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Language Pro- cessing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199. URL: https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19. [42] M. Yasunaga, J. Leskovec, P. Liang, LinkBERT: Pretraining Language Models with Document Links, in: Association for Computational Linguistics (ACL), 2022. [43] OpenAI, Gpt-3.5 model, https://www.openai.com, 2023. [44] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient foundation language models, 2023. arXiv:2302.13971. [45] G. García Subies, Á. Barbero Jiménez, P. Martínez Fernández, A comparative analysis of spanish clinical encoder-based models on ner and classification tasks, Journal of the American Medical Informatics Association (2024) ocae054. [46] A. J. Tamayo Herrera, D. A. Burgos, A. Gelbukh, Clinical text mining in spanish enhanced by negationdetection and named entity recognition, Computación y Sistemas 27 (2023) 1169–1181. [47] H. Verma, S. Bergler, N. Tahaei, Comparing and combining some popular ner approaches on biomedical tasks, arXiv preprint arXiv:2305.19120 (2023). [48] A. V. Serrano, G. G. Subies, H. M. Zamorano, N. A. Garcia, D. Samy, D. B. Sanchez, A. M. Sandoval, M. G. Nieto, A. B. Jimenez, Rigoberta: a state-of-the-art language model for spanish, arXiv preprint arXiv:2205.10233 (2022). [49] F. Gallego, G. López-García, L. Gasco-Sánchez, M. Krallinger, F. J. Veredas, Clinlinker: Medical entity linking of clinical concept mentions in spanish, in: International Conference on Computational Science, Springer, 2024, pp. 266–280. [50] E. Amigó, A. Ariza-Casabona, V. Fresno, M. A. Martí, Information Theory–based Compositional Distributional Semantics, Computational Linguistics 48 (2022) 907–948. URL: https://doi.org/10. 1162/coli_a_00454. doi:10.1162/coli_a_00454. [51] R. Bawden, K. B. Cohen, C. Grozea, A. J. Yepes, M. Kittner, M. Krallinger, N. Mah, A. Neveol, M. Neves, F. Soares, et al., Findings of the wmt 2019 biomedical translation shared task: Evaluation for medline abstracts and biomedical terminologies, in: ACL 2019 Fourth Conference on Machine Translation, Association for Computational Linguistics, 2019, pp. 29–53. [52] M. Neves, A. J. Yepes, A. Siu, R. Roller, P. Thomas, M. V. Navarro, L. Yeganova, D. Wiemann, G. M. Di Nunzio, F. Vezzani, et al., Findings of the WMT 2022 biomedical translation shared task: Monolingual clinical case reports, in: WMT22-Seventh Conference on Machine Translation, 2022, pp. 694–723.