1. Introduction

Multilingual Embedding and Prompt-Driven Approaches for Named Entity Recognition, Entity Linking, and Clinical Code Prediction in Greek Discharge Summaries

Poojan Vachharajani

0 0 Netaji Subhas University of Technology , New Delhi , India

2025

This paper describes our participation in the ELCardioCC shared task under the team name pjmathematician, focusing on Named Entity Recognition (NER), Entity Linking (EL), and Multi-Label Classification with Explainable AI (MLC-X) over Greek cardiology discharge letters. For the NER and EL subtasks, we employed various configurations based on large language models, including Qwen2.5-32B-Instruct and a fine-tuned LoRA variant, combined with prompt engineering strategies. Entities were extracted from unstructured Greek clinical text and semantically matched to ICD-10 codes using embeddings from the multilingual-e5-large-instruct model. In the MLC-X task, we leveraged a larger Qwen2.5-72B model, guided by a candidate list of codes generated via multilingual document embeddings. This system explored cross-lingual semantic similarity and prompt-tuned LLMs to perform entity-level and document-level clinical coding in a low-resource language setting.

eol>Clinical Natural Language Processing Named Entity Recognition Entity Linking Large Language Models (LLMs) ICD-10 Coding Greek NLP

1. Introduction 2. Dataset

The ELCardioCC task provided a specialized corpus of 1,000 Greek discharge letters from a cardiology department for training and validation, and 500 letters for testing. These documents were annotated with mentions (chief complaint, diagnosis, prior medical history, drugs, and cardiac echo) and their corresponding ICD-10 codes. The dataset reflects the complexities of real-world clinical narratives, including specialized terminology and abbreviations [ 2 ]. A separate ‘codes.csv‘ file containing ICD-10 codes and their descriptions, and a ‘labelset.txt‘ with the target ICD-10 codes for the task were also provided. For our fine-tuning experiments, the 1000-instance training set was split into a 700-instance training and 300-instance validation set.

3. Methods

Our approach varied across the subtasks, primarily utilizing LLMs for NER and initial predictions, and sentence embeddings for EL and supporting MLC-X.

3.1. Named Entity Recognition (NER) - Subtask 1 For NER, we experimented with three main configurations:

• Config 1 & 2 (Base LLM) : These two configurations were identical experimental runs to assess the stochasticity of the model’s output. Both used the base ‘Qwen/Qwen2.5-32B-Instruct‘ model [ 3 ]. Inference was performed using LMDeploy [ 4 ] with a ‘TurbomindEngineConfig‘ and a generative configuration set for sampling (‘top_p=0.8‘, ‘temperature=0.8‘). A detailed, zero-shot prompt (see Appendix A.1) was designed to instruct the LLM to translate the Greek text, detect entities, provide an English translation for each, and explain its relevance in a structured JSON format. • Config 3 (LoRA Fine-tuned LLM) : This configuration utilized a LoRA [ 5 ] fine-tuned version of ‘Qwen/Qwen2.5-32B-Instruct-AWQ‘. Fine-tuning was performed using the LLaMA-Factory framework [ 6 ] on the 700-instance training split. A simpler prompt was used for fine-tuning, focusing on direct entity extraction (see Appendix A.2). Key hyperparameters included a learning rate of 5e-06, 2.0 training epochs, LoRA rank of 16, and LoRA alpha of 32. The final submission used the checkpoint from training step 20, chosen based on preliminary validation. The JSON output from the LLMs was parsed using a custom function to retrieve the list of entity mentions. We noted occasional JSON parsing errors, a practical challenge when using LLMs for structured data generation, which were handled with fallback logic.

3.2. Entity Linking (EL) - Subtask 2

The EL subtask built upon the entities recognized in Subtask 1. For each extracted Greek entity, we linked it to the most appropriate ICD-10 code from the ‘labelset.txt‘ using semantic similarity with the ‘intfloat/multilingual-e5-large-instruct‘ sentence transformer model [ 7 ].

1. Corpus Preparation: An ICD-10 code corpus was created by combining the 3-character codes from ‘labelset.txt‘ with their detailed English descriptions from ‘codes.csv‘. Embeddings were pre-computed for each code’s description. 2. Query Embedding: For each entity extracted by the LLM, a query embedding was generated using an instructional format. If a contextual relevance description was available (from Config 1 & 2), the prompt was: "Instruct: Given an entity, retrieve the related medical Disease\nQuery: [entity_relevance_english]". If only the Greek entity was available, the prompt was: "Instruct: Given a Greek entity, retrieve the related medical Disease\nQuery: [greek_entity]". This dualprompting strategy aimed to leverage the richer context when available. 3. Similarity Matching: The cosine similarity between the query embedding and all ICD-10 code embeddings was calculated. The ICD-10 code with the highest similarity was assigned. The three EL configurations (‘config1‘, ‘config2‘, ‘config3‘) directly corresponded to the NER configurations, using their respective entity outputs.

3.3. Multi-Label Classification with Explainable AI (MLC-X) - Subtask 3

For the MLC-X subtask, we used a larger, more capable model, ‘Qwen/Qwen2.5-72B-Instruct-AWQ‘, believing its enhanced reasoning abilities would be beneficial for this complex document-level task. A two-stage approach was adopted: 1. Candidate Generation: To reduce the search space and guide the LLM, a candidate list of ICD-10 codes was generated for each document. We created embeddings for the original Greek text and its two diferent English translations (produced by our Config 1 and 2 systems) using ‘multilingual-e5-large-instruct‘. For each of the three texts, we found the top 20 most similar ICD-10 codes from our corpus. The union of these three sets formed the final candidate list for the LLM. 2. LLM-based Classification and Explanation : The 72B model was prompted with the Greek text and the filtered list of candidate codes and their descriptions. The prompt (see Appendix A.3) instructed the model to select the relevant codes and extract the exact Greek phrases justifying each selection.

Two final configurations were submitted:

• Config 4 (MLC with Evidence) : The direct output of the 72B LLM, including both the predicted

ICD-10 codes and their supporting textual evidence. • Config 5 (MLC no Evidence) : From the same LLM output as Config 4, we extracted only the predicted ICD-10 codes and submitted them with mention positions set to -1, as per task guidelines for a classification-only submission.

4. Experiments and Results

The oficial evaluation was performed on the test set of 500 discharge letters.

4.1. NER and EL Results 4.2. MLC-X Results

Table 3 presents our oficial results for Subtask 3a. For MLC-X, Config 5 (codes only) achieved a higher F1-score than Config 4 (codes with evidence). This is likely because the Subtask 3a evaluation metric penalizes for incorrect mention boundaries. By not providing them, Config 5 avoids this penalty, resulting in a score that purely reflects the code classification accuracy. The stronger performance in this subtask compared to EL suggests that the two-stage approach—using embeddings for candidate ifltering and a larger LLM for final classification—is a more efective strategy for document-level coding.

5. Discussion

Our experiments highlighted several challenges and insights. The primary challenge was the inherent dificulty of clinical NLP in a low-resource language. The performance of EL was critically dependent on the upstream NER task; any errors in entity recognition directly propagated, limiting the potential of the linking module.

The use of diferent LLM sizes was a conscious design choice. The 32B model was deemed suficient for the more straightforward (though still challenging) NER task, while the larger 72B model was reserved for the more complex MLC-X reasoning task, which involved selecting from a candidate list and finding evidence. Our results suggest this was a reasonable approach, as the MLC-X F1-scores were considerably higher than the EL scores.

The LoRA fine-tuning experiment (Config 3) yielded interesting results. While it did not outperform the zero-shot base model, it demonstrated a trade-of between precision and recall. With only two training epochs and a small dataset (700 examples), the model may have been under-tuned. Further training or more sophisticated prompt-tuning could potentially improve its performance.

Finally, a practical challenge was the reliability of LLMs in generating perfectly formatted JSON, with several parsing errors encountered during our experiments. This underscores the need for robust post-processing and error-handling when integrating LLMs into structured data extraction pipelines.

6. Conclusion

We presented our systems for the ELCardioCC shared task, leveraging a combination of large language models (Qwen2.5-32B and Qwen2.5-72B), LoRA fine-tuning, and multilingual sentence embeddings. Our approach demonstrated the feasibility of using modern NLP techniques for NER, EL, and MLC-X on Greek clinical text. The results indicate that while zero-shot prompting of capable LLMs provides a strong baseline, the performance of downstream tasks like EL is highly sensitive to NER quality. For document-level classification, a hybrid approach combining semantic retrieval for candidate generation with a powerful LLM for final classification and explanation proved to be the most efective strategy. Future work could explore joint NER and EL models to mitigate error propagation and more extensive ifne-tuning to better adapt models to the specific clinical dialect.

Acknowledgments

We thank the organizers of the ELCardioCC shared task for providing the dataset and the evaluation platform. We also acknowledge the developers of the open-source tools and models used in this work, including Hugging Face Transformers, Sentence Transformers, LMDeploy, and LLaMA-Factory.

Declaration on Generative AI

During the preparation of this work, the author(s) used Gemini2.5 Pro in order to: Grammar and spelling check. After using this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

A. Prompts Used in Experiments A.1. Prompt for Base LLM (Config 1 & 2)

This prompt was used for the zero-shot NER and EL tasks with the base ‘Qwen/Qwen2.5-32B-Instruct‘ model. } </JSON> } , . . . " e n t i t y " : < e x t r a c t e d e n t i t y h e r e ( same c a s e e x a c t l y how

i t a p p e a r s i n t h e g r e e k t e x t ) > ,

A.2. Prompt for LoRA Fine-Tuning and Inference (Config 3)

This simpler system prompt was used for fine-tuning and inference with the LoRA-adapted model, focusing directly on entity extraction.

A.3. Prompt for MLC-X Task (Config 4 & 5)

This prompt was used with the ‘Qwen/Qwen2.5-72B-Instruct-AWQ‘ model for the multi-label classification task. The ‘‘ placeholders were populated with the Greek text and the candidate ICD-10 codes, respectively. # M e d i c a l Coding Task : Greek Text t o ICD10 C l a s s i f i c a t i o n ## Your Task You w i l l a n a l y z e Greek m e d i c a l t e x t s and i d e n t i f y r e l e v a n t ICD10 c o d e s from a p r o v i d e d s e t . For each r e l e v a n t code , e x t r a c t t h e s p e c i f i c Greek te rms / p h r a s e s from t h e t e x t t h a t j u s t i f y t h i s c l a s s i f i c a t i o n . 2 . Analyze each p r o v i d e d ICD10 code and d e t e r m i n e i f i t a p p l i e s t o t h e m e d i c a l t e x t 3 . For each r e l e v a n t ICD10 code , i d e n t i f y and e x t r a c t t h e e x a c t

Greek terms / p h r a s e s t h a t s u p p o r t t h i s code a s s i g n m e n t 4 . Only i n c l u d e ICD10 c o d e s t h a t a r e c l e a r l y s u p p o r t e d by t h e t e x t 5 . E x t r a c t e n t i t i e s e x a c t l y a s t h e y a p p e a r i n t h e Greek t e x t ( p r e s e r v e e x a c t s p e l l i n g , a c c e n t s , and form ) 6 . I f no c o d e s a r e r e l e v a n t , r e t u r n an empty ICD10 a r r a y ## Output Format Respond i n v a l i d JSON f o r m a t with t h i s e x a c t s t r u c t u r e : <JSON> { " ICD10 " : [

{ } </JSON>

USER_PROMPT = " " " ## Greek M e d i c a l Text { } ## ICD10 Codes { } " " " " code " : " A25 " , " e n t i t i e s " : [ " e x a c t g r e e k p h r a s e 1 " , " e x a c t g r e e k p h r a s e 2 " ]

[1] Tsoumakas , G. , Giannakoulas , G. , Samaras , A. , Dimitriadis , D. , Patsiou , V. , Bekiaridou , A. ELCardioCC: Advancing Clinical Coding in Cardiology: A Challenge on Named Entity Recognition, Entity Linking, Multi-label Classification & Explainable AI. Task Overview . https://elcardiocc.web.auth.gr/

[2]

ELCardioCC

Shared Task Organizers . Dataset Description. ELCardioCC Website . https://elcardiocc. web.auth.gr/#dataset

[3]

Qwen

Team . Qwen2.5 Technical Report . https://qwenlm.github.io/blog/qwen1.5/

[4]

LMDeploy

Contributors . LMDeploy: A High-throughput LLM Inference Engine . GitHub Repository . https://github.com/InternLM/lmdeploy

[5] Hu , E.J. , Shen , Y. , Wallis , P. , Allen-Zhu , Z. , Li , Y. , Wang , S. , Wang , L. , Chen , W. LoRA: Low-Rank Adaptation of Large Language Models . In International Conference on Learning Representations (ICLR) , 2022 .

[6] LLaMA-Factory Contributors. LLaMA Factory: Unified LLaMA Fine-tuning Framework . GitHub Repository . https://github.com/hiyouga/LLaMA-Factory

[7] Wang , L. , Yang , N. , Huang , J. , Du , X. , Wei , F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv preprint arXiv:2212.03533 , 2022 .

[8] Dimitriadis , D. , Patsiou , V. , Stoikopoulou , E. , Toumpas , A. , Kipouros , A. , Papadopoulos , D. , Bekiaridou , A. , Barmpagiannos , K. , Vasilopoulou , A. , Barmpagiannos , A. , Samaras , A. , Giannakoulas , G. , and Tsoumakas , G. Overview of ElCardioCC Task on Clinical Coding in Cardiology at BioASQ 2025 . In CLEF 2025 Working Notes, edited by Guglielmo Faggioli, Nicola Ferro, Paolo Rosso, and Damiano Spina, 2025 .