Enhancing Clinical Data Capture: Developing a Natural
                         Language Processing Pipeline for Converting Free Text
                         Admission Notes to Structured EHR Data
                         Patrick Styll1,* , Wojciech Kusa1 and Allan Hanbury1
                         1
                             Data Science Research Unit (E194-04), Technische Universität Wien, Favoritenstraße 9-11, 1040 Vienna, Austria


                                         Abstract
                                         Automating the extraction of essential patient information from clinical texts, such as admission notes, can
                                         significantly enhance the entry of this data into Electronic Health Records (EHR), thereby enhancing workflow
                                         efficiency and supporting improved patient care and healthcare management. To address this issue, we introduce
                                         a Natural Language Processing (NLP) pipeline designed to (i) automatically extract patient data via Named Entity
                                         Recognition (NER), (ii) normalize the extracted data to correspond to codes in official medical ontologies, and
                                         (iii) coerce the data into EHR format using Health Level 7’s (HL7) Fast Healthcare Interoperability Resources
                                         (FHIR) standard. By adhering to these widely used standardized formats, the pipeline output can be immediately
                                         integrated into the Hospital Information System (HIS).
                                         To achieve this, we propose a newly labeled dataset comprising 255 notes from unlabelled datasets published by
                                         the Text Retrieval Conference’s (TREC) Clinical Trials tracks. Finally, we utilize SapBERT for the normalization
                                         of extracted entities and employ the FHIR standard as a basis to generate Electronic Health Records (EHRs).

                                         Keywords
                                         Clinical Named Entity Recognition, SapBERT, FHIR, Electronic Health Records, ICD-10, NDC


                         1. Introduction

                         The increasing volume of clinical text data presents both challenges and opportunities for the healthcare
                         sector [2]. Extracting meaningful information from these texts, such as personal patient data, is critical
                         for applications in patient care, clinical research and healthcare management [3]. Admission notes,
                         written by doctors when a new patient is admitted to the hospital, include essential patient details such
                         as gender, age, and various medical conditions. Currently, doctors manually input this information into
                         an Electronic Health Record (EHR), creating a bottleneck in the medical workflow. To address this issue,
                         we introduce a Natural Language Processing (NLP) pipeline designed to (i) automatically extract patient
                         data via Named Entity Recognition (NER), (ii) normalize the extracted data to correspond to codes in
                         official medical ontologies, and (iii) coerce the data into EHR format using Health Level 7 (HL7) Fast
                         Healthcare Interoperability Resources (FHIR) Standard. By adhering to these widely used standardized
                         formats, the pipeline output can be immediately integrated into the Hospital Information System (HIS).
                         In Section 2, we present related work, while in Section 3, we introduce the first step of the pipeline, where
                         we extract relevant information from clinical texts, i.e. patient related information from admission notes.
                         We introduce our specific goal and evaluation metrics used. Furthermore, we propose a newly labeled
                         dataset comprising 255 entries from unlabelled datasets published by the Text Retrieval Conference
                         (TREC) Clinical Trials tracks. We show origins of the data, justify our labelling techniques and present
                         insights from our Exploratory Data Analysis (EDA). We train and fine-tune several Bidirectional Encoder

                         NL4AI 2024: Eighth Workshop on Natural Language for Artificial Intelligence, November 26-27th, 2024, Bolzano, Italy [1]
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ patrick.styll@tuwien.ac.at (P. Styll); wojciech.kusa@tuwien.ac.at (W. Kusa); allan.hanbury@tuwien.ac.at (A. Hanbury)
                          https://github.com/Padraig20 (P. Styll); https://wojciechkusa.github.io/ (W. Kusa);
                         https://informatics.tuwien.ac.at/people/allan-hanbury (A. Hanbury)
                          0000-0003-4420-4147 (W. Kusa); 0000-0002-7149-5843 (A. Hanbury)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Representations from Transformers (BERT) [4] models, evaluate their performance and explore different
techniques to further enhance them. In Section 4, we handle both the second and third step of the
pipeline. Firstly, we introduce the official medical ontologies we use, which are the 10𝑡ℎ revision of
the International Statistical Classification of Diseases and Related Health Problem (ICD-10) and the
National Drug Code (NDC). We also explain the underlying technology in normalizing the extracted
information via Self-Aligning Pretrained BERT (SapBERT) [5] [6]. Secondly, we reveal how we used the
HL7 FHIR standard to coerce all extracted and normalized information into an EHR. Finally, in Section 6
we conclude our research, discuss and summarize our findings and present a web-interface showcasing
the whole integrated workflow.


2. Related Work

For the first part of our pipeline, we largely rely on microsoft/mdeberta-v3-base [7] [8] as our baseline.
This large, multilingual general-domain model has recently gained recognition for its effectiveness
in processing medical data, making it a suitable choice for medical NER. Furthermore, in our partic-
ipation [9] in the MultiCardioNER [10] task from the BioASQ [11] workshop at CLEF2024, we have
found valuable insights which we make further use of in this paper. The shared task focuses on the
multilingual adaptation of clinical NER systems to the cardiology domain. It includes two key tasks:
disease detection in Spanish texts and drug detection across Italian, Spanish, and English texts.
In the second step during our pipeline, we use Self-Aligning Pretrained BERT (SapBERT) [5][6] for
normalizing the extracted entities to standardized codes. SapBERT is a specialized version of the BERT
model, developed specifically for biomedical and clinical text mining and designed to create high-
quality embeddings of medical texts. In order to generate embeddings that are particularly well-suited
for biomedical applications, the model has been exposed to large datasets of biomedical literature
and clinical notes during training. As of standardized codes, we use 10𝑡ℎ revision of the International
Statistical Classification of Diseases and Related Health Problem (ICD-10) to classify medical conditions,
symptoms and medical procedures. The ICD-10 is a globally recognized medical classification system
developed by the World Health Organization (WHO), and has since become a critical tool for diagnosing
and classifying a wide range of diseases and health conditions. For pharmaceuticals, we have decided
to make use of National Drug Code (NDC), which is a unique identifier largely used in the United States
for drugs and other pharmaceutical products. Established by the Food and Drug Administration (FDA),
the NDC serves as a universal product identifier for human drugs.
For coercing the output into an EHR, we have decided to use Fast Healthcare Interoperability Re-
sources (FHIR) in the third step of the pipeline. FHIR is a standard for exchanging healthcare information
electronically, designed to enable interoperability between different healthcare systems. Since it is
designed to be extensible, it allows developers to build custom applications and extensions without
jeopardizing compatibility - this is a big factor on why we use FHIR as the output format of our pipeline.


3. Named Entity Recognition for Patient Admission Notes

In this section, we introduce the initial phase of our clinical text processing pipeline, focusing on the
extraction of crucial patient-related information from admission notes. We look into specific objectives
and evaluation metrics, and we introduce and explore a newly labeled dataset. We look into the training
and fine-tuning of various BERT models, along with strategies for enhancing their performance.
3.1. Dataset Preparation and Exploratory Data Analysis

3.1.1. Data Collection

The primary dataset originates from the TREC CT/CDS topics, publicly accessible on the track’s official
website1 . Each topic has a similar structure, including several diagnoses in free text format. The topics
represent admission notes containing the most important patient details which a doctor takes as soon
as a person is admitted to a hospital. This includes personal information and demographics, such as
gender and age, but also the current medical conditions, symptoms, medications/treatments and medical
procedures. The dataset makes a total of 255 entries (topics). This includes:

        • TREC CDS 2016 [12] - each topic is split into three separate fields: note, description and
          summary. Since each field contains the same information in other words, they will be processed
          individually, creating a total of 90 topics.
        • TREC CT 2021 [13] - 75 topics in total with one field.
        • TREC CT 2022 [14] - 50 topics in total with one field.
        • TREC CT 2023 [15] - preprocessed to free text in admission note style via GPT-3 - 40 topics in
          total. More details on preprocessing can be found in [16].


3.1.2. Data Labelling and Analysis

For simplification purposes, we have decided to focus on four different entities, encompassing the most
important information which has to be extracted from admission notes.

        • Medical Conditions
          Medical conditions describe long-term conditions, such as diabetes mellitus or COVID-19.
        • Symptoms
          In contrast to medical conditions, these describe mostly short-term conditions, which may be
          indicators of medical conditions. E.g. fever, a symptom, is an indicator for COVID-19, a medical
          condition.
        • Medication/Treatment
          This could either describe medicine (e.g. Ritalin) or treatment (e.g. rehab).
        • Medical Procedure
          This includes both invasive and non-invasive procedures, such as tracheostomy or MR.

The labelling of the dataset has been done via the open-source tool doccano [17] by the author of this
paper. See Table 1 for a summary of annotated data by entity type, showcasing how imbalanced the
entity to non-entity ratio is. It is important to mention that this specific dataset has not been reviewed
by domain experts.

Table 1
Statistics-Summary for all entity types, showing the imbalanced nature of the dataset. Each count represents a
token, i.e. subword. The data adheres to the IBO-tagging format.
                       Entity Type      Medication   Symptom Procedure          MedCond
                       B-count             370         900       231              1080
                       I-count             168         538       127               642
                       Total Subwords                      32942

There were several issues while labelling the data. The term medical condition is not entirely clear
and subject to interpretation. For instance, we can observe the relationship between medical condition
and symptom. E.g., a fever is not a medical condition - it is a response to medical condition or disease.
1
    http://trec-cds.org/
The same goes for dysuria, being the subsequent response to e.g. UTIs (Urinary Tract Infections), a
 collection of various medical conditions. On the other hand the question arises whether injuries can be
 seen as medical conditions. In fact, injuries, such as a broken arm, are not considered medical conditions
- injuries themselves are their own category in the medical language, which are, however, not included
 in this analysis.


3.2. Model Training and Evaluation

For evaluating the models, we used entity-level evaluation metrics [18], consistent with our previous
submission for MultiCardioNER [9], specifically using 𝐹 1𝑎𝑣𝑔 . Since we are working with a highly
imbalanced dataset, entity-level evaluation provides a more accurate assessment of NER performance.
For the NER step of our pipeline, we have decided to use microsoft/mdeberta-v3-base [7] [8] instead
of models with less parameters such as google-bert/bert-base-multilingual-cased [19] or specialized
models as alvaroalon2/biobert_diseases_ner [20], since we were most successful with it in previous
experiments dealing with medical NER [9]. For hyperparameter-tuning, we used a 70:15:15 split; we
observed that changes in certain parameters led to large performance differences in the model. These
include the learning rate, where higher values (i.e. 0.1) lead to worse results (𝐹 1𝑎𝑣𝑔 of ≈ 0.8); low
values (i.e. 0.0001) also led to bad performance, suggesting that a certain balance is required. Similar
behaviour can be observed for the batch size, where 16 appears to be the optimum. 𝐹 1𝑎𝑣𝑔 no longer
significantly changes after about 10 epochs, and even drops, showing signs of overfitting the training
data. In the end, the parameters we achieved from tuning and therefore used for training are 16 for batch
size, 0.01 for learning rate and 128 for the input size of the model, running with the SGD optimizer
for 10 epochs. These parameters gave us a final validation 𝐹 1𝑎𝑣𝑔 of 85.6% and training 𝐹 1𝑎𝑣𝑔 of
89.1%. The training history of the final model can be observed in Figure 1. Table 2 demonstrates the
optimized results for all entity types. Unfortunately, the metrics for surgical procedures are relatively
low compared to the other metrics - this is largely due to the fact that surgical procedures are rather
scarce (see Table 1) and offer a more diverse vocabulary. More data would be crucial to receive better
results.


Figure 1: Training for Medical Condition with optimized Parameters.


3.2.1. Effect of Data Augmentation

As can be seen in section 3.1.2, there exist great imbalances in the relative ratio between entities and
non-entities. In order to tackle this problem and increase model accuracy, we have decided to even
these modalities out via data augmentation. In detail, we shuffle the sentences and their respective
entities around in random order and thus generate new model input, essentially doubling the amount
of training data. This augmentation has only been performed on the train set, while the validation
and test set were left unchanged. For sentence detection, we have used spaCy [21]. In general, this
resulted in overall increased metric values, as can be seen in Table 2; bear in mind that they originate
from already fine-tuned models.

Table 2
Validation results of tuned models for Medical Condition, Symptom, Medication/Treatment, and Surgical Procedure
before (left) and after data augmentation (i.e., shuffling of sentences) has been performed (right).
            Entity Type                F1avg                                                   Entity Type            F1avg
            Medical Condition          85.6%                                                   Medical Condition      90.8%
            Symptom                    80.0%                                                   Symptom                83.2%
            Medication/Treatment       75.8%                                                   Medication/Treatment   78.6%
            Surgical Procedure         67.2%                                                   Surgical Procedure     71.4%


4. Generation of Structured Electronic Health Records (EHRs)

In this section, we address both the second and third steps of our pipeline. We introduce the standard
medical ontologies utilized in our work, and we also dive into the technology used for normalizing the
extracted information, specifically focusing on the application of SapBERT. Following this, we discuss
how we used the HL7 FHIR standard to integrate all extracted and normalized data into an EHR system.
For entitiy types Medical Condition, Medication/Treatment and Symptom we use the ICD-10 codes taken
directly from the website for Centers for Medicare & Medicaid Services2 , and for Medication we use
NDC codes. These were taken from the FDA’s official website openFDA3 , but had to be thoroughly
preprocessed for use. We used both the proprietary and non-proprietary name for matching the code.
Furthermore, since each NDC code includes packaging information, which we do not extract from the
text, we have arbitrarily selected one code to represent the medicine. As a result, the packaging details
associated with this code may not be accurate.


                                        Entity                                     ICD/NDC
                                                                                                      feature space


                                      Embeddings                                  Embeddings


                                                     Nearest Neighbor (k=1)
                                                   Consine Similarity Threshold


                                                            SapBERT
                                                                                                      raw data


                                       Extracted                                   ICD/NDC
                                        Entities                                    Codes


Figure 2: SapBERT integrated workflow. The raw data, i.e. the extracted entities and medical classification
codes, are transformed into feature space (embeddings). From there, they are matched via nearest neighbour
search with 𝑘 = 1 based on a cosine similarity threshold.


4.1. Integration with SapBERT

The integration with SapBERT is required for medical entity normalization. In order to standardize
the extracted entities, we need to connect each of them with their respective ICD-10/NDC code. In
2
    https://www.cms.gov/medicare/coding-billing/icd-10-codes
3
    https://open.fda.gov/data/ndc/
detail, we leverage SapBERT to create embeddings for ICD-10 and NDC. The core steps of the workflow
(see Figure 2) include generating embeddings for these codes, performing nearest neighbor search,
and determining the cosine similarity between embeddings to find the closest matches based on a
pre-defined threshold. Based on experience from initial experiments, we chose 0.4 for ICD-10 codes
and 0.3 for NDC codes.


4.2. Application of the FHIR Standard

The input of this part is (i) the extracted text of the NER model, (ii) the normalized entity and (iii)
the corresponding ICD-10/NDC code. These triples are then grouped into the fitting FHIR resource,
which represent specific types of clinical and administrative information in the FHIR standard. The
HL7 organization offers a FHIR Resource Guide4 , with which it was quite simple to find and use the
appropriate resources. The FHIR resource templates for each separate entity type and an example for a
FHIR Resource Bundle as a final EHR can be found in the GitHub repository. It is interesting to note
that the resources templates for Medical Condition and Symptom are the same, except for a note being
used to highlight the difference. This once more emphasizes the content-related overlap of these entity
types as described in section 3.1.2.


5. Limitations and Future Work

One of the primary limitations of this work lies in the lack of a quantitative evaluation of the mapping
methodology introduced in Section 4.1. While the approach to map extracted entities to standard codes
shows promise, we do not provide a formal assessment of its performance. Future iterations of this
study should aim to address this gap by introducing an appropriate evaluation framework, allowing
for a stronger argument regarding the effectiveness of the mapping mechanism and its applicability in
real-world scenarios. Furthermore, the dataset used in this study, consisting of approximately 32943
wordpieces, poses potential challenges for the generalizability of the findings. A dataset of this size,
while sufficient for a proof-of-concept, may not capture the full complexity and variability present in
larger, real-world datasets. Moreover, the dataset lacks expert intervention during the labeling process,
which introduces the possibility of inaccuracies in entity extraction. In future work, incorporating expert
validation for at least a subset of the data would enhance the quality and accuracy of the annotations,
providing a more robust foundation for the entity extraction and mapping methods.


6. Conclusion

We have established an NLP pipeline for processing free-text admission notes into EHR. In Section 3,
we look into the problem of medical NER. We select the appropriate architecture and define metrics for
demonstrative evaluation. We describe the data mining process, explore the data and justify our data
labelling processes. Finally, we train models through various strategies and assess their performance.
In Section 4, we gave insights into medical classification lists such as ICD-10 and NDC. We further
showcase how we take the output of Section 3 to normalize the extracted entities via SapBERT. Finally,
we show how the FHIR standard aids us in generating a standardized EHR. Furthermore, we have created
a web interface to showcase all three steps of the pipeline: (i) the extracted entities inside the admission
note, (ii) the normalized entities, including the extracted text, the normalized text and ICD-10/NDC code,
and (iii) the automatically generated FHIR Resource Bundle representing a standardized EHR. Any code
can be found in the GitHub repositories Padraig20/Disease-Detection-NLP and Padraig20/EHR-Generator
for medical NER and the EHR-Generator including the SapBERT workflow as well as the Web-Interface,
respectively.
4
    https://www.hl7.org/fhir/resourceguide.html
References
 [1] G. Bonetta, C. D. Hromei, L. Siciliani, M. A. Stranisci, Preface to the Eighth Workshop on Natural
     Language for Artificial Intelligence (NL4AI), in: Proceedings of the Eighth Workshop on Natural
     Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of
     the Italian Association for Artificial Intelligence (AI*IA 2024), 2024.
 [2] H. Dalianis, Clinical text mining: Secondary use of electronic patient records, Springer Nature,
     2018.
 [3] D. Demner-Fushman, N. Elhadad, C. Friedman, Natural language processing for health-related
     texts, in: Biomedical Informatics: Computer Applications in Health Care and Biomedicine, Springer,
     2021, pp. 241–272.
 [4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, CoRR abs/1706.03762 (2017). URL: http://arxiv.org/abs/1706.03762.
     arXiv:1706.03762.
 [5] F. Liu, E. Shareghi, Z. Meng, M. Basaldella, N. Collier, Self-alignment pretraining for biomedical
     entity representations, in: Proceedings of the 2021 Conference of the North American Chapter of
     the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4228–
     4238.
 [6] F. Liu, I. Vulić, A. Korhonen, N. Collier, Learning domain-specialised representations for cross-
     lingual biomedical entity linking, in: Proceedings of ACL-IJCNLP 2021, 2021, pp. 565–574.
 [7] P. He, J. Gao, W. Chen, DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
     Gradient-Disentangled Embedding Sharing, 2021. arXiv:2111.09543.
 [8] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention, in:
     International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?
     id=XPZIaotutsD.
 [9] P. Styll, L. Campillos-Llanos, W. Kusa, A. Hanbury, Cross-linguistic disease and drug detection
     in cardiology clinical texts: Methods and outcomes, in: CLEF 2024: Conference and Labs of
     the Evaluation Forum, Technische Universität Wien, Spanish National Research Council (CSIC),
     Grenoble, France, 2024.
[10] S. Lima-López, E. Farré-Maduell, J. Rodríguez-Miret, M. Rodríguez-Ortega, L. Lilli, J. Lenkowicz,
     G. Ceroni, J. Kossoff, A. Shah, A. Nentidis, A. Krithara, G. Katsimpras, G. Paliouras, M. Krallinger,
     Overview of MultiCardioNER task at BioASQ 2024 on Medical Speciality and Language Adaptation
     of Clinical NER Systems for Spanish, English and Italian, in: G. Faggioli, N. Ferro, P. Galuščáková,
     A. García Seco de Herrera (Eds.), CLEF Working Notes, 2024.
[11] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
     N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
     twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
     in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
     A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
     Association (CLEF 2024), 2024.
[12] K. Roberts, D. Demner-Fushman, E. M. Voorhees, W. R. Hersh, Overview of the trec 2016 clinical
     decision support track, in: Proceedings of the Text REtrieval Conference (TREC) 2016, National
     Institute of Standards and Technology (NIST), NIST, Gaithersburg, MD, 2016.
[13] I. Soboroff, Overview of trec 2021, in: Proceedings of the Text REtrieval Conference (TREC) 2021,
     National Institute of Standards and Technology (NIST), NIST, Gaithersburg, MD, 2021.
[14] K. Roberts, D. Demner-Fushman, E. M. Voorhees, S. Bedrick, W. R. Hersh, Overview of the trec
     2022 clinical trials track, in: Proceedings of the Text REtrieval Conference (TREC) 2022, National
     Institute of Standards and Technology (NIST), NIST, Gaithersburg, MD, 2022.
[15] I. Soboroff, Overview of trec 2023, in: Proceedings of the Text REtrieval Conference (TREC) 2023,
     National Institute of Standards and Technology (NIST), NIST, Gaithersburg, MD, 2023.
[16] W. Kusa, P. Styll, M. Seeliger, O. E. Mendoza, A. Hanbury, Dossier at trec 2023 clinical trials track,
     in: Proceedings of the Text REtrieval Conference (TREC), National Institute of Standards and
     Technology (NIST), Gaithersburg, MD, USA, 2023.
[17] H. Nakayama, T. Kubo, J. Kamura, Y. Taniguchi, X. Liang, doccano: Text annotation
     tool for human, 2018. URL: https://github.com/doccano/doccano, software available from
     https://github.com/doccano/doccano.
[18] D. S. Batista, Named-entity evaluation metrics based on entity-level, 2018. URL: https://www.
     davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/, accessed: 2024-05-21.
[19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
     and Short Papers), 2019, pp. 4171–4186.
[20] Á. Alonso Casero, Named entity recognition and normalization in biomedical literature: a practical
     case in SARS-CoV-2 literature, 2021. URL: https://oa.upm.es/67933/, unpublished.
[21] Explosion-AI, spaCy: Industrial-strength Natural Language Processing in Python, https://spacy.io/
     usage/linguistic-features#sbd, 2023. URL: https://spacy.io/, version 3.0.